1. Introduction
Hip Disability and Osteoarthritis Outcome Survey (HOOS) was developed as a region- and disease-specific outcome to assess disability pertaining to osteoarthritis (OA) [
1]. The HOOS development process relied heavily on two previously developed instruments: (1) the Western Ontario McMaster Osteoarthritis Score (WOMAC), and (2) the Knee Injury and Osteoarthritis Outcome Score (KOOS). The WOMAC is a disease-specific instrument validated for OA in the lower extremities and for evaluating outcomes after a total hip arthroplasty (THA) [
1,
2], while the KOOS is a region-specific instrument intended to measure pain, symptoms, activities of daily living (ADLs), sport and recreation function, and knee-related quality of life (QOL) in middle aged patients with or without knee OA [
3]. The HOOS contains items (n = 24) and proposed constructs (i.e., pain, stiffness, physical function) of the WOMAC [
1,
4]; the HOOS also contains items (n = 11) and proposed constructs (i.e., sport and recreational function, QOL) derived from the KOOS to expand the constructs measured from the WOMAC and to improve scale sensitivity and responsiveness in younger, more athletically active patients undergoing a THA for treatment of OA [
1]. Lastly, authors of the HOOS constructed five additional items: two in the pain construct, two in the symptoms construct, and one in the sport and recreation construct.
In total, the HOOS assesses five dimensions of hip-related health with 40 items: Pain (10 items), Symptoms (5 items), Function limitations-daily living (17 items), Sport and Recreation Function (Sport = 4 items), and Hip Related QOL (QOL = 4 items) [
1,
4]. The scale uses five-option Likert-boxes with three different scale options (i.e., never to always, none to extreme, or never to constantly) across the five constructs [
1,
4]. All items are scored 0 to 4; each dimension is individually scored then transformed into a 0–100 scale [
1,
4], with 100 indicating no symptoms and 0 indicating extreme symptoms. The HOOS has been translated to 26 different languages and is recognized in the United States as an acceptable outcome for measuring functional assessment in patients 21 years of age and older who have been diagnosed with OA [
5]. Although the HOOS was developed to assess outcomes in patients with OA, the HOOS is also used globally for reimbursement purposes to assess short-term and long-term changes induced by a variety of treatment options including, but not limited to, THA [
4]. As such, it is pertinent to assess the measurement properties of the HOOS across diverse patient populations.
Early scale validation work has focused on validity (i.e., construct), reliability, and instrument responsiveness [
1,
6,
7,
8]. Construct validity of the HOOS was assessed by correlating the constructs to the SF-36, where moderate correlations (
r = 0.49–0.66) were identified between constructs measuring physical health (i.e., function and pain); however, weaker correlations were identified for assessment of mental health (
r = 0.26) [
1]. Additionally, Cronbach’s alpha values for the constructs (e.g., pain, symptoms) of the HOOS ranged from 0.75 to 0.98 across multiple studies [
6,
7,
8]. Values ranging from 0.70–0.89 have been generally recommended for each construct within an instrument [
9,
10,
11]: (1) exceptionally high values (i.e., >0.90) may be indicative of item redundancy, parallel items, construct underrepresentation, inclusion of too many items, and reduced precision [
11,
12,
13,
14], and (2) low values (i.e., <0.70 in general; ≤0.80 for research tools) may indicate poor internal consistency within the instrument [
9,
10,
12,
13,
14].
Test-retest reliability of the HOOS has also been reported, with values ranging from good to excellent (ICC = 0.75 to 0.97) [
6,
7,
8]. Responsiveness (i.e., the validity of the HOOS over time) has also been established; the HOOS was significantly more responsive in the pain and symptoms constructs (SRM: 2.11, 1.83) compared to the pain and stiffness constructs of the WOMAC (SRM: 1.83, 1.28) [
1]. Lastly, patients younger than 66 years of age reported a higher responsiveness in all five constructs of the HOOS compared to patients over the age of 66 [
1].
Few researchers have examined the psychometric properties of the HOOS using exploratory factor analysis (EFA) or confirmatory factor analysis (CFA) and invariance procedures to verify the underlying factor structure and ensure measurement invariance as is recommended in scale development [
8,
9,
10,
11,
12]. Minimal studies pertaining to CFA procedures have been published on individual constructs proposed in the original HOOS [
15]. Previous authors performed CFAs on individual constructs and model fit recommendations were met for pain (i.e., CFI = 0.99, TLI = 0.98) and function (i.e., CFI = 0.97, TLI = 0.97); however, other recommended fit indices were not met (i.e., RMSEA = 0.14–0.19) for these constructs [
15]. The full scale (i.e., all constructs) should also be assessed using factor analysis procedures, as recommended for best practices [
9,
10,
12,
13]. Researchers have reported that the full scale structure did not meet contemporary fit recommendations (CFI = 0.85; TLI = 0.84; IFI = 0.85; RMSEA = 0.10) in a sample of primarily self-reported healthy participants [
16]. Further, correlations found between first-order latent constructs (e.g., pain, function) were high (ranging from
r = 0.80–0.96); modification indices revealed several meaningful cross-loadings were present (e.g., putting on socks/stockings, taking off socks/stockings) and assessment of error-term correlations revealed that most of the items shared commonalities [
16]. Overall findings of this study do not support the factorial validity of the original HOOS structure and suggest the presence of multicollinearity (i.e., overlapping items or items that are perceived to ask similar questions) and reduced measurement precision [
10,
11,
12].
The reported poor psychometric properties of the HOOS were not surprising given that the scale is predominantly derived from the WOMAC, which has also been reported to have questionable psychometric properties. For example, poor fit indices values and error-term correlation findings on the HOOS [
16] are similar to those found when examining the scale structure of the WOMAC [
17,
18]. Authors identified that the pain construct was not supported as a single factor with uncorrelated error-terms (CFI = 0.90; TLI = 0.80; RMSEA = 0.21), and the modification indices revealed significant error correlations between “at night while in bed” and “sitting or lying” and “walking on flat ground” and “going up or down stairs” [
18]. Researchers examining the scale structure of the WOMAC performed a CFA on 11 of the 24 items (i.e., 3 pain items, 8 function items) and reported moderate overall fit in two samples (CFI = 0.95–0.97; RMSEA = 0.70–0.08) [
17]. However, error-term correlations were specified between the items in this model, which included pain item 1 (i.e., walking on flat surface) and function item 6 (i.e., walking on flat surface), pain item 2 (i.e., up/down stairs) and function item 2 (i.e., ascending stairs), and function item 7 (i.e., getting in/out of a car) and function item 9 (i.e., putting on socks) [
17]. The addition of error-term correlations between items limits the conclusions that can be drawn from the scale; as such, previous methods of scoring may not be sufficient as the items correlated cannot be scored separately [
19]. Therefore, difficulties arise when trying to interpret the scoring for the instrument and is not recommended for best practices [
12,
19].
While the previous findings call into question the factorial validity of the HOOS and WOMAC, there were limitations noted in the previous studies worth considering. First, the CFAs were generally performed on individual constructs instead of examining the full factor structure as is recommended [
11,
12,
20,
21]. Second, one study included a CFA on the full model that utilized a moderate sample (n = 655) of mostly healthy respondents [
16]. Further, invariance testing (e.g., multi-group, longitudinal) results have not been reported in the target population (i.e., THA patients); this testing is an important process that ensures the interpretations between groups (e.g., male vs. females) or across time (e.g., preoperative and postoperative visits) are valid and reliable [
9,
22].
Despite the current use of the HOOS (e.g., approved outcome measure for reimbursement purposes following THA), complete and robust psychometric analysis of the scale in a large dataset of patients for which the scale is designed has not been performed. As such, there is a need to conduct a CFA on the full scale to test the hypothesized factor structure of the 40-item HOOS to ensure that the items are appropriate indirect measures of the hypothesized latent constructs in a large, targeted sample of patients seeking care for the hip (e.g., THA) [
21,
22]. If the scale structure fails to meet recommended levels during the CFA procedures, alternate model generation should be conducted, following best practice recommendations [
23], on the given items to determine if a parsimonious scale structure can be identified prior to further testing [
12,
21,
22]. Further, there is need for invariance testing to ensure the scale is unbiased towards different groups of interest [
12,
21,
22]. Lasty, it is important to understand how different groups respond to the outcomes over time postoperatively.
Therefore, the primary purpose of this study was to assess the psychometric properties of the HOOS in a large, diverse sample of patients who underwent a THA. Because model fit recommendations for the instrument were not met, alternate model generation procedures were performed to identify a more parsimonious model. The secondary purpose of this study was to conduct invariance testing between age groups and sex (i.e., multi-group), and longitudinal invariance (i.e., across time points), and to perform latent growth-curve (LGC) modeling on the parsimonious scale structure identified.
4. Discussion
With the occurrence of THAs expected to significantly increase by 2050 [
34], it is imperative that clinicians have access to PROs that can be widely used across different sexes (i.e., males and females), age groups (i.e., 18–94), and repeated visits. Having a PRO to assess the patient’s perspective of hip health throughout their recovery is beneficial to clinicians to ensure positive outcomes following arthroplasty. More recently, significant attention has been focused on PROs associated with outcomes following THA [
35,
36]. Therefore, the need to establish a psychometrically sound tool that adequately measures the multifaceted nature of hip pain and function is valuable. Previous psychometric analysis on the HOOS has yet to yield a scale structure that meets recommended model fit indices [
16], and assessment of how age and sex influence patient responses to the individual items and mean scores has not been conducted [
1,
16,
37] As such, the primary purpose of this study was to assess the psychometric properties of the original 40-item HOOS in patients undergoing THA. The CFA of the 40-item HOOS did not meet recommended model fit indices. Therefore, an EFA was conducted to establish a more parsimonious scale structure. The alternate three-factor, nine-item (HOOS-9) was then subjected to multi-group invariance testing (i.e., sex and age groups), longitudinal invariance testing across five time points (i.e., preoperatively, 6-months postoperatively, 1-year postoperatively, 2-years postoperatively, 3-years postoperatively), and LGC modeling across five time points. The alternate HOOS-9 met recommended measurement criteria and can be recommended for use in research and clinical practice.
4.1. Confirmatory Factor Analysis
The original five-factor, forty-item scale structure of the HOOS was not supported in our study, as demonstrated by the poor model fit indices and the high latent variable correlations. However, our findings are consistent with previous research where a well-supported scale structure in mostly healthy adults was not found [
16]. High to very high correlations (
r = 0.77–0.91) between the first-order latent variables were found, indicating multicollinearity between factors. Modification indices also demonstrated there were items with meaningful cross-loadings, indicating overlapping items were present (e.g., item six (how often is your hip painful) and item 37 (how often are you aware of your hip problem) and high error-term correlations between several items (e.g., item 24 [putting on socks/stockings] and item 26 [taking off socks/stockings]). These findings further suggest the presence of multicollinearity. Poor model fit and the presence of multicollinearity provides evidence that the current 40-item scale should not be used [
12,
21,
38]. Thus, to determine if a psychometrically sound version could be identified using the original items, alternate model generation was warranted [
12,
21,
38].
4.2. Psychometric Analysis of the Alternate HOOS-9
An EFA was conducted using a calibration sample (i.e., sample n1), which yielded an alternate three-factor, nine-item solution (i.e., HOOS-9). The nine items represented three of the original five constructs of the HOOS: three items from “Function, daily living” (i.e., original HOOS items 17, 19, 21), three items from “Function, sports and recreational activities” (i.e., original HOOS items 33, 34, 35), and three items from “Quality of Life” (i.e., original HOOS items 38, 39, 40). The alternate HOOS-9 underwent covariance modeling procedures using the validation sample (i.e., sample n2). As the alternate HOOS-9 only retained 22.50% of the questions from the original scale, participant responses were highly correlated (r = 0.93) with the original 40-item HOOS and accounted for a substantial amount of the variance (R2 = 0.87).
The three-factor structure identified in our sample is different than other HOOS short-forms, including the HOOS-JR (i.e., original HOOS items 18, 15, 18, 20, 27, 29), HOOS-PS (i.e., original HOOS items 29, 16, 28, 34, 35), and more specifically the three-factor, twelve-item HOOS (HOOS-12) [
15,
39]. The HOOS-12 is short-form version of the original 40-item HOOS that includes three-factors (i.e., Pain, Function daily living, and QOL) consisting of twelve items (i.e., original HOOS items 6, 9, 10,12, 18, 19, 22, 36–40); however, our alternate HOOS-9 model contains four items present in the HOOS-12 (i.e., original HOOS items 19, 38, 39, 40) in the ADLs and QOL construct. When developing the HOOS-12, authors used computerized adaptive test (CAT) simulations to identify items to best match patients’ level of pain and function [
39]. Limitations exist with the use of CAT such as high cost and the adaptability of the questionnaire to the individual persons responses [
40]. Therefore, patients may not be answering the same questions based on their responses to the bank of items. This methodology poses further limitations on the ability of clinicians attempting to draw conclusions of the PROs; as such, CAT may not always be appropriate in the clinical setting [
40]. Additional assessment of structural validity on the HOOS-12 was conducted by performing CFAs on the individual constructs (i.e., pain, function, QOL) and not on the full scale [
15]. Best practice recommendations when performing CFA is to assess the entire scale and, if the model meets recommended fit indices, perform invariance testing (e.g., multi-group, longitudinal) to ensure the instrument can be used across several groups and time points [
12,
20,
21].
In addition to these findings, previous research assessing structural validity using CFA on the full HOOS-12 did not support its use in a sample of mostly healthy adults [
16]. Several concerns related to the scale were noted: high correlations between the constructs (i.e., indicating potential multicollinearity), high Cronbach’s alpha values (i.e., indicating potential item redundancy), and cross-loadings of items (i.e., items shared commonalities) [
16,
21]. Therefore, further testing of the HOOS-12 was not warranted in the population studied [
16]. In our identified model, correlations between constructs ranged from 0.38–0.56, and modification indices did not reveal cross-loadings between items. Therefore, our findings present a newly refined short-form version of original HOOS items that measures unique constructs.
4.3. Multi-Group and Longitudinal Invariance Testing of the Alternate HOOS-9
We assessed group differences using CFA methods between groups of interest (i.e., age groups and sex) and across several time points for the alternate HOOS-9. Invariance testing confirms the structural validity of the scale, ensuring the association between constructs (i.e., ADLs, Sport, and QOL) are being measured and their items are being interpreted similarly across groups (i.e., males, females) and time (i.e., multiple visits) [
12,
20,
21]. Thus, an invariant instrument allows clinicians to compare scores across groups or visits and provides support that score differences in hip health are true group differences as opposed to measurement errors [
12,
21]. Minimal studies exist assessing multi-group and longitudinal invariance using any versions of the HOOS; previous work has focused on invariance testing pertaining to multiple short-forms (i.e., the HOOS-JR and HOOS-PS) [
16]. In a previous study, however, differences between hip pathology and physical activity groups in the HOOS-JR and HOOS-PS were assessed [
16]. In a more recent study, multi-group (i.e., age groups and sex) and longitudinal invariance (i.e., multiple visits) in a similar sample of patients who underwent a THA was also conducted. To our knowledge, this was only the second study to assess multi-group and longitudinal invariance in a short-form version (i.e., HOOS-9) using items from the original 40-item HOOS.
We found the alternate HOOS-9 was invariant at the preoperative visit (i.e., preoperative THA) between age groups (i.e., <45, 45–54, 55–64, 65–74, ≥75) and sex (i.e., males, females). These results indicate that the new alternate scale can be used to assess differences in hip-related dysfunction in patients undergoing THA. In addition to our invariant findings, significant latent variances and latent mean differences were not found between age groups or sex, suggesting that minimal differences in hip disability were perceived between groups. These findings are different from those found in previous research, where latent mean differences in sex groups were found with females reporting higher mean scores on the HOOS-JR compared to males. In addition, other researchers identified sex and age differences, with females and those in older age groups reporting higher scores on the 40-item HOOS and HOOS-12 for all domains [
41,
42]. However, Sunden et al. only identified significant differences between males and females in the oldest age group (i.e., 75–84); no significant differences in mean scores were found between males and females in different age groups (i.e., 18–35, 3–54, 55–74) [
41]. Larsen et al. found significantly worse HOOS and HOOS-12 scores with increasing age. Within our population, the majority of our sample was younger than 75 years of age, which could partially explain these findings [
42]. Of important note, these findings are associated with different versions of the HOOS scale, which include different items. Having different items compared to the other versions indicates that the scales are not necessarily measuring hip disability in the exact same way. Therefore, our findings are unique in that the scale structure of the alternate HOOS-9 demonstrates no significant differences between sex and age groups.
This study also provides evidence of scale validity of the alternate HOOS-9 for assessing postoperative effects across time. Longitudinal invariance was established across multiple visits (i.e., preoperatively and 6-months, 1-year, 2-years, and 3-years postoperatively), indicating that the scale can be used to assess differences in hip disability across time. Thus, the results supported the assessment of mean scores across time to determine if scores changed post-THA. We identified significant latent mean differences across time points, indicating that patients reported a meaningful improvement in scores preoperatively to 3-years postoperatively. In addition, the highest scores (i.e., more hip disability) were reported preoperatively and the lowest scores (i.e., less hip disability) were identified at 3-years postoperatively. These findings provide support for scale validity as patients who receive surgery would be expected to report improvement over time following the intervention (i.e., THA) as natural healing occurs across visits. These findings are congruent with previous research reporting significant improvement in scores on the HOOS and HOOS-12 in patients who underwent THA from preoperatively to 2-years postoperatively [
43,
44].
4.4. Alternate HOOS-9 Latent Growth-Curve Modeling
To our knowledge, this was the first study to perform LGC modeling in patients who answered questions of the 40-item HOOS over a 3-year period postoperatively. Use of LGC modeling is a robust technique that allows researchers to assess between-person differences and within-person change, which is unique compared to traditional longitudinal assessments (e.g., repeated-measures analyses or multivariate analyses) [
20,
45]. In addition, LGC modeling is highly flexible when attempting to assess differences in unequally spaced time points (e.g., months, years) and for more complex nonlinear data [
20,
45]. Few studies identified assessed outcomes related to hip disability (i.e., HOOS Physical Function [HOOS-PS], Oxford Hip Score [OHS]) over a 12-month period postoperatively in patients who underwent THA [
46,
47]. In addition, other researchers assessed LGC of the OHS in patients over 6-weeks postoperatively [
48]. These three studies all identified a nonlinear improvement, with most improvement occurring within the first 6-weeks to 3-months postoperatively [
46,
47,
48]. These findings are similar to ours; the lack of fit within the linear model, along with the re-defined nonlinear model, demonstrates the majority of the growth and improvement in scores occurred within the first 6-months postoperatively.
Researchers defined groups by healing trajectories (e.g., fast starters, early recovery) [
47,
48], PROs (i.e., OHS, HOOS-PS), or how the patients scored (i.e., high-high, intermediate, low-high) [
46]. These defined groups differ from our study, which examines the differences age groups and sex have on responses to the alternate HOOS-9 over time. Our results indicate that patients in the male group have an overall higher score at baseline (40.63) compared to those in the female group (34.56). In addition, patients in both male and female groups who scored lower at baseline had an overall faster growth over time, although those in the female group had a slower rate of growth over time in comparison to males (−52.41 vs. −67.61), respectively. These findings are similar to Hesseling et al., who reported females were considered slow starters, meaning they had slower improvement in hip function and QOL within the first 3-months postoperatively but had a significant overall improvement at 1-year postoperatively [
47]. In addition, we found females had a higher mean score at 3-years postoperatively (92.86) in comparison to males (81.75). However, even though differences were identified between patients in the male and female groups, the variances of the model for the intercept and shape were not statistically significant. This finding indicates that there were no significant differences between the two groups (i.e., interindividual differences).
To our knowledge this was also the first study to assess different age groups across time points. When assessing these differences, patients in the age group 45–54 scored the lowest overall at baseline (33.42) when compared to the other groups, and patients in the age group ≥75 had the highest overall mean score at baseline (41.12). In addition, patients in the age group >45 who had lower self-perceived hip function and QOL reported greater improvements in their scores (−461.30) compared with those in the aged 45–54 (−103.42) and 65–74 (−342.90); however, they had a slower rate of increase in scores over time. Patients in the ≥75 group had a steeper growth and improvement in HOOS-9 scores (451.13) over time, though had an overall lower mean score at time 5 (82.67) in comparison to those in the age group 55–64 (133.96). These findings indicate that patients in the age group ≥ 75 improve their hip disability and QOL faster but have an overall lower score on the alternate HOOS-9 compared to the other age groups. Variances between the intercept and slope, however, were not statistically significant (p > 0.05), which indicate interindividual differences are homogenous rather than heterogeneous.
4.5. Limitations and Future Research
Though this study contained a large sample of patients undergoing a THA, limitations are present that should be addressed. Though a cross-validation sample was used to assess the alternate HOOS-9, participants responded to the original 40-item HOOS. As such, the responses to items on the alternate HOOS-9 could be influenced by the other 31 items [
38]. Therefore, future research should assess the scale structure on patients who only answer the nine items [
38]. We assessed concurrent validity (i.e., correlation between two scales) between the original constructs of the 40-item HOOS and the newly proposed scale. Future researchers may want to consider conducting further analyses that correlate the HOOS-9 responses with other scales designed to measure similar dimensions (e.g., QOL). As this is the first study to report the HOOS-9, limitations may exist when attempting to assess differences in clinical practice and research. Therefore, future research should be conducted to determine the responsiveness, minimal clinically important difference, and reliability of the instrument.
Additionally, even though the HOOS-9 was invariant between groups of interest (i.e., sex and age groups), our dataset was limited and did not include other pertinent information, such as demographic data (e.g., race, ethnicity, medical history), diagnosis (e.g., osteoarthritis, hip dysplasia), surgical procedure (i.e., primary, revision), or operative data (e.g., surgical approach, laterality, implant type). Thus, caution is warranted when examining alternate HOOS-9 differences in groups that have not yet been analyzed. Future research should focus on invariance testing modeling across several different groups (e.g., diagnosis, surgical approach, surgical procedure) to ensure the scale has the necessary properties to support between groups analysis in these populations. Also, it may be pertinent for the development of a database that allows for collection of outcomes and patient information longitudinally. This would allow researchers to track outcomes associated with the surgical procedure and potentially identify populations in need of further medical care (e.g., infections, revisions) to determine if further scale refinement or creation of a new scale is needed in such populations. In addition, further analyses using LGC modeling with these different groups could help clinicians and researchers understand healing differences over time.
Another limitation of this study was the decision to score the alternate HOOS-9 as a total score versus scoring each construct individually for purposes of LGC modeling. Scoring PROs as a total score is common practice for documentation purposes to be able to easily assess changes over time. Our model fit statistics had low to moderate correlations between the first-order latent variables, thus providing justification that the items are measuring unique constructs. However, we performed a bifactor model to determine if a composite score could be used even though the constructs were unique. Our findings reveal acceptable goodness-of-fit indices, indicating that clinicians may be able to score the alternate HOOS-9 as a total summed score. Therefore, future research should be conducted to assess the reliability and validity (e.g., responsiveness) of the alternate HOOS-9 using the total summed scores. Although we had an overall large sample for this study, the sample size was much smaller (n = 1140) when assessing invariance and differences over time (i.e., longitudinal invariance and LCG modeling) due to the low percentage (17.40%) of patients who answered the items over all time points. Therefore, future research should be conducted in a larger sample to ensure similar findings exist.