Now that key definitions have been provided, it is time to outline the steps involved in constructing a scale. We will begin with “topic selection”—a process that can receive short shrift from those with limited psychometric training.
4.2. Generating Scale Items
Earlier, we mentioned that “bigger is not necessarily better” when it comes to Cronbach’s alpha coefficient. However, with respect to your preliminary pool of items, your personal mantra should be “bigger is better”. You want to create a large pool of items because many of those items will be eliminated when they are assessed by experts (i.e., the items go through the process of content validation).
We each learned this lesson after publishing various papers. For example, TGM co-authored a paper describing the construction of a measure entitled The Anti-Fat Attitudes Scale (AFAS) [
28]. In this article, he detailed concerns with the principal instrument used to measure this construct. However, what he and his co-author failed to do was specify the parameters of the construct and develop a large pool of items to represent those parameters. Instead, they started—and finished—with five items. The authors did not seek out content experts to evaluate those items, nor did they pilot test them with “lay experts” (i.e., overweight individuals) to determine if they felt items on the AFAS reflected common prejudices directed at persons perceived to be overweight. The value of the AFAS was compromised because the authors did not test its content representativeness. It is possible that the totality of anti-fat prejudice can be captured by five items; possible, but unlikely. Similarly, CJB published an article examining heterosexual participants’ homonegative reactions in response to same-sex dyad stimuli and an interactive gay male target. His goal was to control the effects of participants’ attitudes toward public displays of affection (PDAs) given the physical intimacy displayed by the dyadic targets. At the time of data collection, a relevant measure of attitudes toward PDAs was not available so one needed to be created. The author created
four items that were neither evaluated by content experts nor pilot-tested [
29].
It is important that the items you generate are rooted within the three pillars critical to content validity: (1) the relevant literature; (2) input from relevant stakeholders; and (3) input from content experts [
30]. It is not uncommon for researchers to create items they
think map onto their topic of interest without paying sufficient attention to the body of literature that exists about that construct. Even in situations where a researcher reviews a handful of relevant articles, there is still the risk of overlooking important aspects of the construct that may not be reflected in the subset of papers reviewed. While a systematic assessment may be an unreasonable expectation, researchers should strive to include as much of the relevant literature as possible during the item generation process.
To assist with creating scale items, we recommend that, first, you consult the relevant literature. For instance, if you wanted to develop a measure examining attitudes toward same-sex fathers, a good place to start would be reviewing research on this topic (ideally, articles published within the last five years). At this stage, both qualitative and quantitative studies should be used. For instance, suppose you read a qualitative study in which a group of gay fathers reported that others had expressed concerns about the fathers’ ability to provide both male and female role models. You might develop several questions to reflect this concern (e.g., “Gay fathers do not provide their child/children with female role models”). In your review of the literature, you also might encounter scales that, while flawed, possess a sprinkling of items that you think warrant consideration. Please note that if you wish to add these items to your pool, you must contact the author(s) in question to obtain permission to do so.
Finally, if you are measuring a novel construct, there may be little published research to draw upon. In such cases, you must rely on more anecdotal sources of inspiration such as conversations with friends/colleagues, self-reflection about the topic, and social media discussions.
We also recommend supplementing a review of the literature with a qualitative component. Focus groups with relevant stakeholders, for instance, can reveal additional facets of the topic that, for whatever reason, may not be reflected in the literature. Imagine that you are creating a measure of attitudes toward two-mother (e.g., lesbian) parenting. It is important to complete a deep dive into the relevant literature pertaining to this topic. However, consulting with women who parent with a same-sex partner can offer additional crucial information that may not be captured by the published research. The inclusion of content experts is discussed in greater detail below.
During initial item development, do not agonize over wording and issues such as clarity and representativeness. Simply create your items! Later, you will have time to determine if item 27 is double-barreled or if item 5 is too wordy. Also, at this juncture, do not be concerned about content overlap across items. Let your mantra be the following: generate, generate, generate!
Based on our experience, we recommend that scales be 10 to 20 items in length. Fewer items can lead to construct under-representation; more than 20, and respondent fatigue may become a concern. Given these endpoints (minimum 10 items; maximum 20 items), your initial pool should comprise no fewer than 50 to 100 items, denoting a conservative retention rate of 20%. Two points are worth noting here. First, as with all psychometric primers, the guidelines we provide reflect our opinions. They are not incontrovertible facts. Second, scale development is not akin to following a recipe. Thus, the proportions we offer should be viewed as rough guidelines. For instance, there may be times when a researcher generates an initial pool of 100 or more items and there may be times when 30 items is sufficient. When it comes to creating a scale, common sense should prevail rather than dogmatic adherence to advice from experts.
In the past, it was common for scale developers to generate a mixture of items that were positively and negatively keyed. For example, in [
13]’s Modern Homonegativity Scale, most of the items are keyed (i.e., scored) in one direction (e.g., “Gay men should stop complaining about the way they are treated in society, and simply get on with their lives”). For these items, agreement denotes stronger endorsement of modern homonegativity. However, a few items are reverse keyed (e.g., “Gay men still need to protest for equal rights”). With these items, agreement reflects
weaker endorsement of modern homonegativity. When you have a mixture of positively and negatively keyed items, you must ensure that they are all keyed (scored) in the same direction. If you fail to do this, the total score on your measure will be unintelligible. Additionally, you might end up with a negative Cronbach’s alpha coefficient. A negative Cronbach’s alpha coefficient tells you that something is “off” about your pool of items. Typically, the culprit is failing to rescore items that are negatively keyed. Inspecting the output entitled “corrected item-total correlation” helps to identify any items that are problematic (i.e., these items will have substantial negative correlations with the sum of all other items). The one thing researchers should not do is ignore the fact that a negative Cronbach’s alpha has been obtained.
Using a mixture of positively and negatively keyed items was viewed as sound psychometric practice because, at the time, it was seen as a way of helping researchers identify participants that were not paying attention. Specifically, if someone selected “strongly agree” to both “Gay men should stop complaining about the way they are treated in society, and simply get on with their lives” and “Gay men still need to protest for equal rights”, this response pattern would suggest that the participant was not reading the attitude statements carefully.
However, most psychometrists now agree that using items keyed (scored) in different directions is problematic (see, for example, ref. [
31]). First, it can be difficult to create items in which strong disagreement and strong agreement reflect the same attitudinal position. Second, negatively keyed items often rely on the use of negation (e.g., “do not”, “will not”, etc.), which can be confusing to participants. Increases in confusion heighten random measurement error, which, in turn, shrinks scale score reliability and the magnitude of correlations in general. For these reasons, we recommend that all scale items be keyed in the same direction such that
higher scores reflect
more of the construct.
To address the issue of careless responding, we routinely embed one to two attention checks into the pool of items being tested. For example, “For this question, please select ‘agree’ as your response”. Any respondent who does not follow this instruction (i.e., selects an option other than “agree”) would be seen as providing questionable data. Researchers then should flag those participants who failed to answer the attention check items correctly. Their data should be retained but not used when the researcher conducts further psychometric testing.
When creating scale items, you want them to be accessible, straightforward, and to reflect a single idea (i.e., double-barreled items must be avoided). Readability software, of your choice, should be used to assess items’ accessibility and straightforwardness (i.e., aim for the reading level of an average 11- to 12-year-old). Do not rely upon your own assessment of items’ readability, nor the assessments of undergraduate/graduate students or members of your research lab/team. Always err on the side of making items simpler rather than more complex. If participants do not understand the words that appear on a scale, they may try to decipher what those words mean. If they make an erroneous guess (e.g., assume that the word “augment” means to decrease or the word “deleterious” means helpful), the value of your data has been compromised.
Researchers should avoid using slang expressions, as these can impose temporal limitations on the value of a scale. Slang that is popular now may be incomprehensible in a few years (e.g., today, younger gay males would likely recognize the term “straight acting” while being unaware of the now anachronistic term “Castro clone”). Idioms (e.g., “bite the bullet” or “beat around the bush”) and proverbs (e.g., “birds of a feather, flock together”) also should be avoided. Finally, researchers should consider carefully whether they wish to include coarse language or pejoratives that may be deemed offensive by some respondents. Blanket condemnation of such terminology is inappropriate; in some cases, slurs or other derogatory terms may be seen as critical elements of a scale. For instance, if we wanted to assess gay men’s experiences of verbal discrimination, we may feel justified in creating an item such as “In the past 4 weeks, I have been called a ‘fag’”. However, we should not be surprised if ethics boards express reservations about the use of this type of item nor should we be surprised if (some) respondents complain about the item’s content.
You also want to ensure that the response format you adopt “fits” with the items. If, for example, you are interested in determining how often respondents experience an event, then “strongly agree” to “strongly disagree” would be inappropriate. Instead, a “never” to “always” response format would make more sense. In addition, beware of crafting items that suggest temporality (e.g., “sometimes”, “rarely”, “usually”, etc.). Employing temporal words can result in your items being incomprehensible. Take, for example, the following item: “Sometimes, I get upset when I think about how trans people are portrayed in mainstream media”. If the response format is “never”, “rarely”, “sometimes”, “often”, and “always”, then how does a participant make an intelligible response? Selecting “often” for this item, would result in “I ‘often’ sometimes get upset about how trans people are portrayed in mainstream media”.
We recommend using a 5- or 7-point Likert-type response format (see [
32]). Having too few response options may lead to insufficient variability; having too many, and participants might become confused. Remember: people tend not to think in very granular terms; consequently, they may be unable to differentiate—in a meaningful way—among options such as “slightly agree”, “agree”, and “strongly agree”. Notice, too, that we recommend using an
odd number of response options; doing so mitigates forced choice responding in which participants are required to “possess” an attitude about your scale items. A useful rule of thumb is that Likert scales with an even number of response options likely reflect forced choice scenarios. On the other hand, Likert scales with an odd number of response options likely have a neutral position built in.
The following response format is forced choice: strongly agree, agree, disagree, and strongly disagree (i.e., four response options). In this case, participants must select an option which denotes a particular position (agreement or disagreement). However, what happens if respondents are uncertain about their attitude or have not yet formed a specific attitude in relation to the item in question? Non-attitudes may surface; that is, participants may “agree” or “disagree” with a scale item but only because they do not have the option of selecting “no opinion” or “don’t know” (see [
33]). Forced choice responding also may elicit satisficing behaviors whereby participants select the first plausible response option rather than the most suitable one (i.e., “None of these items really apply to me but this option seems like it
could reflect my belief/opinion”). To avoid non-attitudes and/or satisficing, we recommend that researchers include options such as “don’t know”, “not applicable”, or “neither agree nor disagree”. These options may appear as the midpoint or be offered as an adjacent option (i.e., their placement will depend on the scale items and the response format that is used). Respondents may be reluctant to select “no opinion” or “don’t know” as a response option because doing so may have implications for their self-esteem [
33]. We recommend that researchers “normalize” this process by emphasizing its acceptability in the instructions at the start of the survey.
We do not recommend using dichotomous response formats (e.g., “yes” versus “no”, “agree” versus “disagree”, and “true” versus “false”). They are reductionistic. Also, many researchers are unaware that binary data require the use of complex statistics (Technically, Likert scales constitute an ordinal form of measurement. However, it is common practice for Likert scales to be seen as providing interval-type data (see [
34]).).
When it comes to response options, clarity is key. In the courses we teach, we routinely ask students to operationalize terms such as “sometimes” or “fairly often”. We are astounded at the variability that emerges—even among individuals who are quite homogeneous (i.e., they fit the profile of “typical” university students enrolled in psychology courses). To some, “fairly often” is “3 times per week” whereas to others, “fairly often” suggests an event occurring 4 to 5 times per month. Similar variation is noted for “sometimes”, “seldom”, and “rarely”. To maximize clarity, we recommend that researchers add numeric qualifiers, where appropriate: never (0 times per week); rarely (at most, once per month); sometimes (at most, 2–3 times per month); and so on. This type of specificity ensures that respondents are operating from the same temporal perspective.
Please note that, depending on your research goals, an alternative to Likert response options may be sought. For example, a semantic differential (SD) scale would offer participants the same number of response options. The key difference between these two scale types is that SD scales are anchored using “bipolar” adjectives (e.g., “bad” to “good” or “cold” to “hot”) while Likert scales assess levels of agreement (or disagreement) with each item. SD scales may be useful in assessing participants’ evaluations of specific constructs. However, be aware that using “bipolar” adjectives may carry some limitations. Consider a 5-point semantic differential scale assessing perceptions of transgender women from “cold” to “warm”. “Cold” would constitute the first of the five options with “warm” the fifth. Concerns arise when one considers that options 2, 3, and 4 do not possess textual indicators given that they reflect points between the poles. Researchers are unable to determine how participants are conceptualizing the differences between each numerical option. These differences may have implications for interpretation of results. Another option is the Bogardus social distance (BSD) scale. A BSD scale measures participants’ affect (e.g., intimacy, hostility, and warmth) toward outgroup members. BSD scales also use 5- or 7-point response options. For example, using a hypothetical 5-point response option along with our example from above, “1” would reflect desiring no social distance from transgender women while “5” would indicate desiring maximum social distance. Like SD scales, there may be discrepancies between participants’ interpretations of the options between desiring “no social distance” versus “maximal distance” from transgender women. The key difference between the two is the semantics of each item: Likert scales employ statements related to the construct of interest while BSD scales more commonly rely upon questions.
4.3. Testing Content Validity
Consider our previous example involving the development of a measure of attitudes toward two-mother parenting. Imagine that you have now created a pool of 100 items. You have reviewed this pool and are satisfied with the quality of the items. They seem to be measuring the construct of interest; namely, attitudes toward two-mother parenting. However, your faith in the caliber of these items is insufficient. They must be evaluated formally.
Given the number of items that you have created, we recommend that you target four to five content experts. This number will ensure that experts are providing granular assessments of approximately 15 to 20 items each. You want to avoid tasking content experts with reviewing a full pool of items. Why? First, because it is unlikely they will do a thorough job (i.e., the experts may be more critical of the earlier items, and more “lenient” with the later items). Second, there is a risk that content experts might begin their review and then stop because they do not have the time required to assess 100+ questions. Third, if the review process is too labor-intensive, it might be difficult to recruit content experts (i.e., few will have the time needed to assess a pool of items containing 100+ items).
Prior to approaching your content experts, it is critical that you meticulously proofread the items to ensure that your pool contains as few grammatical and spelling errors as possible. Having multiple “pairs of eyes” involved in this process is invaluable.
The experts should be individuals with a background in psychometrics and/or expertise in the topic you are examining. In this hypothetical example, you would recruit experts who have published on the topic of same-sex parenting (ideally, two-mother parenting). A developmental psychologist, who is known for their parenting research, but who has limited experience with SGMPs, would not constitute a suitable expert. Each expert would review their subset of items on various dimensions including item clarity, item representativeness, and item quality. Given that assisting with this process is voluntary and time-consuming, we recommend that a small honorarium (e.g., a $10 gift card to a merchant of their choice) be provided.
For each item, a mean rating would be calculated per dimension. Assume that, for item clarity, a 3-point scale was used: 1 = item is not clear; 2 = item needs revision to be clear; and 3 = item is clear. In this context, item clarity is subjective and based on the perception of the experts you approach. In addition to implementing a 3-point scale, you may also wish to request, time permitting, brief comments regarding
why the experts believe specific items require revision (i.e., items receiving a score of 2) (see [
35] for more details). Any item that has an average score < 2 would be removed from further consideration (i.e., an item regarded by experts as unclear would be eliminated). The outcome of this evaluative process will vary depending on the perceived quality of the items. For example, if most of the items are viewed as clear and representative, then—at the content validity stage—the pool will not have diminished appreciably (Fear not. We will discuss additional techniques that are useful in winnowing pools of items.). If, however, most of the items are perceived as unclear and non-representative of the construct of interest, then you will have to spend time revising and generating new items. The resultant pool then will need to be
reassessed by content experts. Ideally, the individuals that evaluated the previous pool also will examine the revised one.
4.4. Quantitative Assessment of Item Integrity
Let us suppose that you have received feedback from content experts about your pool of items on same-sex parenting. A total of 50 of your 100 items were removed (i.e., 32 were viewed as unclear and 18 were seen as unrepresentative). An additional 15 items were seen as requiring revision. Given the size of your item pool, you elect to drop 10 of these items and revise the remaining 5. Thus, you now have a 40-item pool. You distribute these items to your content experts a second time. The feedback is positive (i.e., all items are seen as clear, representative, etc.). Does this mean you now are ready to distribute your 40-item measure to participants? Unfortunately, the answer is no.
You still need to gauge whether any of the retained items are redundant; have limited variability (i.e., only one or two response options are selected by most participants); or are biased (i.e., responses to the items correlate strongly with responses on indices measuring social desirability). To assess such concerns, you need to distribute your item pool to a small sample of participants (e.g., 50 to 75 individuals). One caveat: this test sample should be like the type of sample you intend to use. Thus, in the case of our hypothetical same-sex parenting measure, if your goal is to distribute this scale to college/university students, then your test sample should also comprise post-secondary respondents.
Participants would receive the test pool, which in this example is 40 items, and a psychometrically sound measure of social desirability bias (We recommend the Social Desirability Scale-17 [
36].). A small number of demographic questions also should be included for diagnostic purposes. For example, you might find that scale score reliability coefficients are satisfactory for self-identified cisgender women, but not for cisgender men. Of course, your ability to conduct these sorts of tests will depend on the size of the test sample (e.g., if you only have five cisgender men, you will be unable to conduct any sort of psychometric testing with this group).
After the data have been collected, you should compute frequencies for each item in the test pool. The goal is to remove items that possess insufficient variability. In our own scale development practice, we have relied upon [
37]’s guidelines (We reiterate, again, that these are guidelines and not edicts.). First, you could remove any item in which more than 50 percent of responses fall into one response option. For instance, if 62% of respondents select “strongly agree” for question X, we recommend eliminating that item because it is providing insufficient variability (i.e., you are observing restriction of range, which has various statistical implications). Second, you could remove any item in which two combined response options total less than 10 percent. To illustrate, assume that 3% “strongly agree” and 5% “agree” with question X. That question should be eliminated because only 8% of participants agree with its content. Third, ref. [
37] contends that the majority (i.e., >50%) of your response options should have a minimum endorsement rate of 10%. For instance, assume that you have reviewed the frequencies for the following item: “Two-mother parents are able to show warmth to their child/children”. You obtain the following values: 45% = strongly agree; 21% = agree; 3% = slightly agree; 9% = neither agree nor disagree; 8% = slightly disagree; 9% = disagree; 5% = strongly disagree. You might opt to remove this item from your pool because most of your response options (i.e., 5 out of 7) do not have a minimum endorsement rate of 10%.
The application of [
37]’s criteria should result in the removal of myriad items from your pool. Be warned: if the number of items removed is high, you may need to create additional items. You do not want a situation in which the pool of retained items is too small (i.e., <10).
Let us assume that, from your initial pool of 40 items, 13 were removed following the use of [
37]’s guidelines. You now have 27 items, which is close to your targeted length (i.e., 20 items); however, we will apply one additional assessment to remove any further items that are suboptimal.
When computing Cronbach’s alpha coefficient, there is output that is helpful in eliminating items. We recommend focusing on two pieces of information. The first is called “Cronbach’s alpha if item deleted”. As you might expect, this tells you what your scale score reliability coefficient (i.e., Cronbach’s alpha) would be, if you removed the item in question from your scale. If the removal of an item does not produce a noticeable decline in Cronbach’s alpha, then the value of that item may be questioned. We recommend using a cut-off of 0.03. Specifically, if eliminating an item decreases Cronbach’s alpha coefficient by 0.03 (or greater), the item should be retained. However, if the decrease is <0.03, the item is a candidate for removal. Should the removal of an item substantially increase Cronbach’s alpha coefficient, this is a sound indicator that the item should be eliminated.
The second piece of output that is useful is called “corrected item–total correlation”. This output refers to the correlation between responses to the item in question and the sum of all remaining items. So, a corrected item–total correlation of 0.52 for item 1 details the correlation between responses on item 1 and the sum of all other items (e.g., items 2 through 27 in the hypothetical attitudes toward same-sex mothering instrument). You want all corrected item–total correlations to be positive. Negative correlations suggest that higher scores on a given item are associated with lower total scores on all remaining items, a result that, assuming all items are scored in the same direction, does not make conceptual sense. Additionally, based on our experience, we recommend that correlations fall between 0.40 and 0.60. Lower correlations (i.e., <0.40) suggest that the item may not link conceptually with the other items; higher correlations (i.e., >0.60) reveal possible redundancy (i.e., you have multiple items that might be assessing the same sliver of content). To summarize, items that correlate negatively with item–total scores or have correlation coefficients < 0.40 or >0.60 are candidates for removal.
Another strategy that can be used to identify and remove weak items involves correlating the scores for each item with the total score on a measure of social desirability bias. In our own practice, we use a cut-off of 0.33 (i.e., any item that correlates at 0.33 or higher, suggesting approximately 10%+ shared variance, with a measure of social desirability bias should be removed). Given that pilot test samples are (typically) small, the magnitude of the correlation is more important than its statistical significance.
The quantitative assessment of item integrity involves applying numeric guidelines to identify items that should be deleted from an item pool. However, you want to avoid blind adherence to these sorts of recommendations. Refining your pool of items is an intricate dance between maximizing the quality of the items, reducing the size of the item pool to a manageable number, and ensuring sufficient content representativeness. In practice, there may be times when items are retained because you think they assess important facets of the construct—even though they fall outside the benchmarks we have listed. Let common sense prevail: if you believe that an item is important, at this stage, keep it. Once you have refined your item pool, you will want to assess its factorial validity.
4.5. Testing Factorial Validity
Factorial validity is not laden with various subtypes. However, computing evidence regarding this form of validity is arguably the most complex. Adding to this complexity is that we have different statistical methods available to us that should follow a specific order. The first step in assessing factorial validity involves exploratory factor analysis (EFA). EFA allows us to identify how many factors our new measure includes and how well the items load on each factor. In other words, we are constructing a theory regarding how the items in our new measure can be organized. EFA is an intricate technique that requires researchers to make a series of well-informed decisions regarding how the data should be treated. Ill-informed decisions may lead to results that could question the utility of one’s new measure. With EFA, we must be cognizant about (1) sample size; (2) extraction methods; (3) rotation; and (4) extraction decisions.
As noted earlier, a general rule of thumb is that a minimum of 200 participants should be recruited when under “moderate conditions” [
25]. In this case, “moderate conditions” merely refers to situations where three or more items load onto each factor and communalities fall between 0.4 and 0.7 (Communalities refer to how much one item correlates with all other items.). If you have recruited the number of participants suggested above, their data can be used for this analysis.
Extraction methods fall under two distinct models: “common factor” (CF) and “principal component analysis” (PCA). A common knowledge gap regarding these models is that each serves a different purpose and cannot be applied interchangeably. In other words, only methods falling under the CF model can execute an EFA. For context, the PCA model is only suitable for situations where researchers are concerned with item reduction; it cannot and does not consider how individual items load onto certain factors (which is exactly what we want!). Unfortunately, the PCA model is regularly reported as the extraction method used to assess factorial validity, particularly among scales relevant to LGBTQ2S+ persons (e.g., see scales reviewed by [
3,
6,
7,
8,
9,
10]). This may be because it is the default setting in many common statistics packages (e.g., SPSS).
To reiterate, ensure that you choose methods rooted in the CF model, of which there are several. One recommendation is the maximum likelihood (ML) method. ML is an excellent choice because it is capable of handling data that moderately violate normality assumptions. For situations where one’s data severely violate normality assumptions, the principal axis factoring (PAF) method is recommended as an alternative strategy. While it may seem logical to just “always” use PAF “just in case”, there are certain benefits of ML over PAF. For example, ref.[
38] note that ML can produce more robust goodness-of-fit indices compared to PAF.
The term “rotation” refers to how your statistics program will situate the axes (i.e., each of your factors) to better coincide (i.e., “fit”) with your individual data points, effectively simplifying the data structure. There exist two main forms of rotation: orthogonal and oblique. Orthogonal rotation yields factors that are uncorrelated, which is problematic in the context of SGMPs since most constructs of interest will yield factors that are related to some degree. As a result, the use of orthogonal rotation will not help us determine simple structure. Oblique rotation, on the other hand, allows, but does not require, factors to be correlated. Given this enhanced flexibility, oblique rotation should always be utilized. Unfortunately, orthogonal rotation methods are often used, likely due to them being the default in commonly used statistical software (e.g., SPSS). There are several oblique rotation methods available (e.g., quartimin, promax, and direct oblimin) but none have emerged as a superior method [
38]. Therefore, selection of any oblique rotation method should yield similar results.
Finally, we need to decide how many factors we should extract based on our EFA. A common approach is to select all factors that possess eigenvalues > 1. However, this approach is completely arbitrary and often leads to overfactoring; hence, it is not a useful strategy. Parallel analysis (PA) is a data-driven method that has been identified as the most accurate (although still prone to occasional overfactoring). As a result, PA should be used in conjunction with a scree test (plotted eigenvalues) to serve as a “check-and-balance” [
39]. While not an on-board function in common statistics programs (e.g., SPSS and SAS), ref. [
40] maintains a syntax file that can be downloaded and easily used to run PA. Ultimately, this final step will help you determine whether your new scale possesses one or more factors. Items that do not load onto any factor or load onto more than one factor should be disregarded. If this process happens to lead to a situation where fewer than 10 items remain, more items may need to be created/added and previous steps carried out again with a new sample.
Once we have selected our factors, it is a good idea to test the reproducibility of our factor structure. This can be achieved using confirmatory factor analysis (CFA). Completed via structural equation modeling, CFA should only be used when there is a strong underlying theory regarding the factor structure of our measure. If we consider our hypothetical attitudes toward same-sex mothers measure, EFA should precede CFA to allow for a robust theory regarding the factor structure we are aiming to confirm. It is important to note that CFA should be conducted with a different sample than was used for our EFA. Since CFA is more advanced than EFA, we recommend consulting other primers that exclusively focus on its various steps and interpretations of their outcomes (e.g., [
41]).
4.7. Important Considerations about Construct Validity
If a gold standard indicator is available, you can test the concurrent validity of your item pool. However, as noted earlier, there may be instances where a gold standard does not exist. You may be testing a novel construct (i.e., one that has not been measured previously), or existing scales may be flawed in substantive ways and, consequently, unable to serve as a “gold standard”. It is important to note that, regardless of whether a gold standard exists, researchers still need to conduct tests of construct validity. Stated simply, you must demonstrate that scores on your proposed scale correlate, for theoretical and/or empirical reasons, with scores on other scales.
Before we provide specific examples of how to test for construct validity, a few points need to be considered. First, the scales you select to test for construct validity must be psychometrically sound. Creating makeshift indices to correlate with your item pool does
not provide compelling evidence of construct validity (or, more specifically, convergent validity). Nor is it sensible to treat, as validation indices, measures that have substandard psychometric properties. You must ensure that the validation measures you select are the “best of the best”. It is important not to settle for the first scale you come across that evaluates the variable you are using to test for construct validity. Also, you should not adhere blindly to the choices made by other researchers. For example, just because a researcher used the Anti-Fat Attitudes Scale (AFAS) [
28], which we critiqued earlier, as a validation measure, should not offer reassurance that the AFAS would be appropriate for you to use in your own psychometric work. Exercise due diligence: ensure that
all measures used in tests of construct validity were created in accordance with best practice recommendations for scale development.
Following a thorough review of the literature, one should
always opt for validation measures that have received
multiple assessments of scale score reliability and validity. This review process also increases the likelihood that researchers will use the most up-to-date versions of validation measures. For instance, assume that we have created a scale assessing gay men’s sense of belonging to the LGBTQ2S+ community. As one strand of construct validity or, more specifically, convergent validity, we hypothesize that gay men reporting a greater sense of belonging should also be more open (i.e., “out”) about their sexual identity. To assess “outness”, we select a scale developed by [
42], entitled the Outness Inventory (OI). A thorough review of the literature, however, would reveal that [
43] subsequently revised the OI (i.e., items were added to measure outness to other members of the LGBTQ2S+ community). Thus, ref. [
43]’s revised version would be a more appropriate choice.
Second, construct validity involves furnishing multiple strands of evidence in support of your pool of items. We recommend generating three to five hypotheses per study, with the confirmation of each prediction offering one strand of support for the scale’s construct validity. To provide compelling evidence of construct validity, most of the hypotheses that you test need to be confirmed. In the absence of this type of confirmation, the psychometric integrity of your new measure is indeterminable.
To test multiple predictions, the researcher can use multiple samples with sufficient power (e.g., 200+ participants per prediction) or a supersized sample (800+ participants) which can be partitioned into subgroups, with each subgroup being used to test different hypotheses.