1. Introduction
Explaining causal relationships is paramount for understanding how treatments influence outcomes within various domains such as medicine, economics, and social sciences. Over the past few decades, envisioning the potential outcomes of each unit in a population under different treatment conditions has become a staple of causal inference. Various causal measures of interest can be defined using the potential-outcome means [
1,
2,
3]. In other words, once we identify the potential-outcome means, we can identify a series of causal measures.
Randomized experiments are viewed as the gold standard for causal inference. The means of subsamples corresponding to different treatment levels can serve as consistent estimators of the corresponding potential-outcome means. However, in practice, the randomized assignment of the treatment is often difficult to implement due to ethical concerns, high costs, and so on. Observational studies are in general more feasible. Under the assumption that the collected data are precisely measured, researchers can use various methods to adjust for confounding and obtain consistent estimates of the potential-outcome means in these studies, including regression [
4], matching [
5], and inverse probability weighting [
6], among others. Nevertheless, in many observational studies, researchers face challenges in obtaining accurate data. For instance, in classic studies on the impact of smoking on lung cancer, smokers may misreport their smoking status; when economists study the determinants of income, respondents may misreport their earnings; in the job market, applicants may falsify their educational qualifications. VanderWeele and Li [
7] pointed out that measurement error is one of the key threats to statistical analysis in observational studies. Measurement errors cause the observed data distribution to deviate from the true distribution, which can further lead to biased or even entirely incorrect conclusions in causal inference. For a long time, many researchers have focused on the estimation of model parameters under measurement error or misclassified data (discrete variable data with measurement errors). The well-known book by Carroll et al. [
8] covered many measurement error models, emphasizing the bias-corrected techniques. Schennach [
9] reviewed much significant progress made in developing estimation and inference methods in the presence of mismeasured data, especially describing approaches that rely on validation data techniques or auxiliary variables, e.g., repeated measurements, multiple indicators, measurement systems with factor model structure, instrumental variables, and panel data. Although the statistical analysis of measurement error data has a long history, surprisingly, with the advent of the big data era and the introduction of new experimental methods, this topic has remained quite active, experiencing a recent increase in attention and research activity. For example, Amorim et al. [
10], Tao et al. [
11], and Amorim et al. [
12] addressed measurement error challenges in multi-phase studies, which include data collected from one or more rounds of validation processes.
Recently, many researchers in the field of causal inference have also focused significantly on the measurement error issue. Boatman et al. [
13] examined the estimation of causal effects through a weighted strategy in randomized clinical trials, addressing noncompliance measured with error. Gravel and Platt [
14] proposed a method for estimating the marginal causal odds ratio when outcomes are misclassified, utilizing internal validation information. Yanagi [
15] provided identification and estimation results for local average treatment effects with misclassified treatment by leveraging an exogenous variable. Shu and Yi [
16,
17,
18] investigated the use of inverse probability weighting estimation for causal effects, integrating validation data sets when the data collected were prone to errors. Edwards et al. [
19] introduced a reparameterized imputation method for addressing measurement error, applicable for estimating counterfactual risk functions or hazard ratios with internal or external validation data. Richardson et al. [
20] employed a reference population to achieve identifiability for causal effects in the presence of measurement error in continuous treatments. Several other studies have addressed measurement error issues in the context of instrumental variable methods or mediation analysis. Notable examples include VanderWeele et al. [
21], Jiang and Ding [
22], and Cheng et al. [
23], among others. Currently, the literature in this field primarily focuses on the misclassification of causal variables, such as treatment variables or mediators, with relatively less attention given to the misclassification of outcome variables. Moreover, for identifiability, while many existing studies address the identifiability of various causal measures, they often do not directly consider the identifiability of the means of the potential outcome variables. Identifying these means enables the identification of a range of causal measures, such as the risk difference, the risk ratio, and the odds ratio. For estimation, current approaches to similar problems frequently depend on specific parametric models, with the accuracy of the resulting estimators being contingent on the correct specification of corresponding models, and the resulting estimators are not necessarily effective. The advancement of the semiparametric efficiency theory and the creation of multiply-robust estimators in the context of causal inference with measurement error are still incomplete.
Our article contributes to the prior literature as follows. First, we directly provide identification results for the potential-outcome means using a special combined sample, composed of a primary sample and a validation sample, where no individual has complete data. By identifying these means, we provide the identification results for various causal measures. Second, based on the semiparametric theory framework [
24], we derive the efficient influence functions for different potential-outcome means under the observed data law. To our knowledge, no similar results have been reported in the context of causal inference with misclassification. Third, building on the efficient influence functions, we develop efficient and multiply-robust estimators for the means. Beyond the target estimands and multiply-robust estimation methods used, our setup and assumptions differ from those in some existing literature on causal inference involving misclassified outcome variables. We consider a more relaxed assumption about the misclassification mechanism compared to [
16,
18]: in addition to the true outcome variable, we allow the misclassification probability to depend on the covariates. Under our setup, the primary sample contains the collected covariates, the treatment variable of interest, and the misclassified outcome variable but lacks information on the true outcome variable. The validation sample, on the other hand, includes the covariates, the misclassified outcome variable, and the information needed to determine the true outcome variable values. This setup aligns well with practical applications and is different from previous work [
14,
16,
18], as the validation sample in this work may come from a validation study where the target treatment variable is different from that of the primary study, thus lacking information on the target treatment variable in the primary study. This scenario is typical in the data fusion literature, where the variables collected in each study may differ [
25]. Similarly, there are no complete data for any subject in our setting.
The structure of the paper is as follows. In
Section 2, we outline the formal setup, the assumptions necessary for our error-prone outcome context, and the nonparametric identification results for the potential-outcome means.
Section 3 discusses the efficient influence functions for the means and proposes the multiply-robust estimation approach. The asymptotic properties of the proposed method are also provided.
Section 4 and
Section 5 demonstrate the finite-sample performance of the proposed method through simulation studies and real data analysis, respectively. The discussion is presented in
Section 6. Technical proofs are provided in
Appendix A.
2. Setup, Assumptions, and Nonparametric Identification
Assume we have obtained two samples from two studies: one of main interest called the primary study, and another called the validation study. In the primary study, the outcome variable cannot be measured accurately, and the collected data are denoted as , where represents the covariates, T is the binary treatment of interest, and W is the misclassified version of the true binary outcome variable Y. When , it indicates that the individual i is assigned to the treatment group, otherwise, they are assigned to the control group. In the primary sample, we assume no information can be used to determine the true value of . In the validation study, we have information that helps ascertain the true outcome values, e.g., carbon monoxide levels can help determine smoking cessation. That is, we actually obtain the data for both Y and W in the validation study. However, the study may not include the treatment T, because this study may be designed to evaluate another treatment influencing the same outcome Y as the primary study. The collected validation data are denoted as . Using either sample alone makes it difficult to identify the potential-outcome means, but combining the two samples makes the identifiability and estimation of the means and a series of causal measures possible. The problem considered in our data analysis aligns with the setup. One data set includes the treatment of interest but only contains the misclassified outcome variables, while the other validation data set provides both the accurate and misclassified outcomes but does not collect the treatment of interest. Let be an indicator, where indicates that the individual i belongs to the primary sample, and indicates that the individual i belongs to the validation sample. Then, the combined sample can be expressed as where . In the following, we omit the subscript i wherever it does not confuse.
Under the potential-outcome framework [
26], let
denote the potential outcome if an individual were assigned to
and
the potential outcome if the individual were assigned to
. Only one of
and
can be observed for an individual, and we assume
(called the consistency assumption) throughout the paper. In this paper, we aim to identify and estimate the potential-outcome means
and
. Many causal measures are defined using the potential-outcome means, such as the risk difference (RD)
the risk ratio (RR)
or the odds ratio
Some existing studies have explained and compared the definitions and applicability of these causal measures [
2,
3,
27,
28,
29]. In addition, when treatment does not cause harm, that is,
, referred to as monotonicity assumption [
30], the joint distribution of the potential outcomes can be written as
Assume that
indicates a positive or beneficial result, then, the first joint distribution represents the rate of “never benefit” (NBR), the second represents the rate of “always benefit” (ABR), and the third represents the treatment benefit rate (TBR) [
31]. On the other hand, the joint distribution of the potential outcomes plays an important role in causal attribution [
32,
33]. It is evident from the definitions that if the potential-outcome means are identifiable, all of the above quantities are identifiable. To proceed, we list the following assumptions:
Assumption 1. , that is, the potential outcomes are independent of the treatment conditional on the observed covariates .
Assumption 2. .
Assumption 3. , that is, the misclassified outcome W is independent of the treatment T conditional on the true outcome Y and the observed covariates .
Assumption 4. .
Assumption 5. for any , where is the support set of .
Assumption 1 is referred to as the unconfoundedness assumption [
34], the assumption is considered feasible in situations where the covariates
are rich enough to include all common causes of both the treatment and the outcome variables. Assumption 2 implies that the misclassified outcome
W correlates with the true outcome
Y conditional on
. Assumption 3 implies that the misclassification probability is allowed to depend on the covariates
. This assumption is weaker than the traditional non-differential misclassification assumption used in Shu and Yi [
16]. Assumptions 2 and 3 are feasible when the misclassification error is directly influenced by
Y and
X but is independent of the treatment assignment. Misclassification errors due to self-reporting generally align with Assumptions 2 and 3. Assumption 4 is naturally satisfied in situations where the validation sample can be viewed as a random subsample drawn from the population of the primary study. Assumption 5 is called the overlap assumption in the causal inference literature, it is feasible when each individual has a non-zero probability of being assigned to each treatment level.
Under Assumptions 1 and 5, when the precise data are available, the potential-outcome means can be identified as
and then the regression-based and the weighting-based estimators can be established by using the plug-in method based on (
1). The consistency of these methods potentially requires the collection of precise outcome variable data, and the naive estimators that directly use the misclassified outcome
W may lead to a seriously biased estimation of the potential-outcome means.
Let
denote the probability density or mass function of a random variable (vector). To tackle the case where the binary outcome
Y is subject to misclassification, we proceed by defining some functions for simplicity of exposition:
and
The following theorem provides the nonparametric identification results for the potential-outcome means under our setup.
Theorem 1. Suppose Assumptions 1–4
hold. Then, can be identified asand can be identified as Note that under Assupmtion 4, we have
and the identification results (
2) and (
3) leverage information from both the primary sample and the validation sample. The identification results of a series of causal measures can be derived following the identification results of the potential-outcome means. For example, the risk difference can be identified as
and the risk ratio can be identified as
Similar results can also be obtained for the causal odds ratio, and the joint distribution of the potential outcomes under the monotonicity assumption.
Based on the identification results of Theorem 1, we can firstly estimate the four functions , , , and by assuming parametric models, giving their estimates through common parametric regression methods. Subsequently, we can provide the estimates for and using the plug-in method. However, the estimators obtained through this strategy may not be efficient and are overly reliant on the model specifications.
3. Efficient and Multiply-Robust Estimation Based on the Semiparametric Theory
In practice, the specification of parametric models may not always be accurate. Multiply-robust estimators, meaning that the estimator retains consistency when one, but not necessarily all, of several model assumptions is correctly specified, are in general a more preferable choice. In the causal inference literature with accurate data, a common strategy for constructing multiply-robust estimators is to study the efficient influence function of the parameter to be estimated [
35,
36], requiring advanced semiparametric theory. From the efficient influence function, efficient estimators can then be constructed, and the multiply-robust properties can be investigated. In this section, we aim to explore the efficient influence functions for
and
within the framework of semiparametric theory in the presence of a misclassified outcome variable. Furthermore, we propose the efficient estimators of
and
and explore their multiply-robust properties.
To facilitate understanding, we give a short review of the concept of asymptotic linearity and influence function in the semiparametric theory framework [
24]. An estimator
of a
p-dimensional parameter
is referred as asymptotically linear, meaning that there exists a
p-dimensional function
of the collected variable
O such that
, and
where
is called the influence function of
. The influence function
with the lowest variance is referred to as the efficient influence function, and the estimator with the efficient influence function is semiparametric efficient.
We proceed by deriving the efficient influence function for
,
, in the next theorem, within our setting. The detailed proof of the theorem can be found in
Appendix A. Define
and
where
and
.
Theorem 2. Under Assumptions 1–5
, the efficient influence function for can be expressed asand the efficient influence function for can be expressed as Theorem 2 provides the efficient influence functions (
4) and (
5) for the two potential-outcome means in the presence of a misclassified outcome variable. According to the semiparametric theory framework, the semiparametric efficiency bound for
is
, and that for
is
. It can be seen that the efficient influence function of the mean
is composed of five distinct components of the observed data law, which are
,
,
,
, and
. We assume their corresponding parametric models as
,
,
,
, and
. Assume the estimators obtained for the nuisance parameters are denoted as
,
,
,
, and
, which are consistent when their corresponding models are correct; then, we can derive the estimates for
,
, and
, denoted as
,
, and
Constructing estimators from the efficient influence functions is a common method when studying semiparametric efficient estimation. Let
denote the empirical mean operator with the sample size
n, which means
for any function
of the observed data
O. Theorem 2 and the definition of the influence function imply that we can derive estimators for
and
by solving
and
and the resulting estimators can be written as
and
Additionally, because our method is directly proposed for the potential-outcome means corresponding to different treatment levels, we can easily derive estimators for a series of causal measures listed in
Section 2. The estimators we propose are derived from the estimating equations constructed using the efficient influence function. When all the involved models are correctly specified, it is straightforward to prove that the proposed estimators are efficient. Furthermore, we demonstrate that the proposed estimators exhibit multiply robustness when one of the below assumptions is correctly specified. We first list the three model assumptions:
: The models , , and are correctly specified, meaning that there exist , , and such that , , and equal to their corresponding true models, respectively.
: The models , , and are correctly specified, meaning that there exist , , and such that , , and equal to their corresponding true models, respectively.
: The models , , and are correctly specified, meaning that there exist , , and such that , , and equal to their corresponding true models, respectively.
The following theorem formally demonstrates the properties of the proposed estimators.
Theorem 3. Suppose the standard regularity conditions [
37]
(pp. 2121–2123
) and Assumptions 1–5
hold. Then, the proposed estimator is consistent and asymptotic normal for under the union set . Moreover, attains the semiparametric efficiency bound when all the working models in are correctly specified. Theorem 3 ensures the multiply robustness of our proposed estimation method, that is to say, our method guarantees the consistency of the resulting estimators as long as one of the listed assumptions is correctly specified. The model sets under different assumptions reflect various combinations of the components of the data law, and when all the involved components are correct, our proposed estimators for and attain their corresponding semiparametric efficiency bounds, which represent the minimum possible asymptotic variances among all regular semiparametric estimators.
Next, we discuss in detail the estimation strategies for the nuisance parameters. Of note, the models in the three assumptions contain elements with overlap. The model sets in
and
both contain
, and those in
and
both contain
. The fact implies the multiply-robust estimator
requires constructing a consistent estimator of
under
, and a consistent estimator of
under
. To achieve this, we consider extending the doubly robust g-estimation [
38] to our measurement error setting with the combined sample. Firstly, according to the form of adopted parametric models
,
, and
, we can apply common parametric regression methods to obtain estimators
,
, and
for
,
, and
, respectively. Secondly, we propose estimators
and
for
and
by solving
and
respectively, where
is the user-specified index function of the same dimension as
, and
and
are of the same dimension as
. We have the following theorem for
and
.
Theorem 4. Suppose the standard regularity conditions [
37]
(pp. 2121–2123
) and Assumptions 1–5
hold. When estimating , the proposed estimator is consistent and asymptotically normal under , and is consistent and asymptotically normal under . Finally, by using the plug-in approach, we utilize the estimates of the nuisance parameters to give the multiply-robust estimator
. From the estimation process, it can be seen that our entire estimation procedure can be viewed as solving a complex system of estimating equations, which consists of estimating equations of the parameters of interest and the nuisance parameters. Thus, the asymptotic variances of the proposed estimators can be derived using standard M-estimation theory [
37,
39]. To facilitate computation, the bootstrap method is commonly employed in practice for variance estimation and the construction of confidence intervals.
4. Simulation Studies
In this section, we conducted some simulation studies to (1) verify the multiply robustness of our proposed method and (2) evaluate the resulting estimators’ finite sample performance in the measurement error setting. Our simulation comprised two examples. In Example 1, we considered the two potential-outcome means, risk difference, and risk ratio as the quantities of interest to be estimated. In Example 2, our data-generating mechanism ensured the validity of the monotonicity assumption, and we considered the joint probabilities of the potential outcomes as the estimands of interest. In both examples, bias, root-mean-square error (RMSE), and standard error (SE) were used as the evaluation criteria, and all results were based on 1000 repeated simulation experiments with sample size and under four cases:
- (1)
All the models are correctly specified;
- (2)
Only the assumption holds;
- (3)
Only the assumption holds;
- (4)
Only the assumption holds.
In Example 1, similar to Wang and Tchetgen Tchetgen [
35], the baseline covariates
included an intercept and a variable
uniformly distributed on the interval
, and we considered the following data-generating mechanism:
where
, and
Under the mechanism, we simulated the treatment
T, the true outcome
Y, and the misclassified outcome
W by noting
The generated data represented realizations of
. We then generated the sample indicator
. A primary sample was created with
, recording only the realizations of
, and a validation sample was generated with
, recording only the realizations of
. The two samples were combined to form the simulated data set. Note that with the above data generating process, the true value of
and
were around
and
, respectively, with the use of the Monte Carlo approach. To implement the proposed method, we chose the identity functions of covariates as the user-specified index functions. We could easily derive the true parametric models for the involved five functions in constructing the multiply-robust estimator for
. For estimating the parameters in models
,
, and
, we used the maximum likelihood method. For models
and
, parameter estimation was conducted using (
6) and (
7) by the method of estimating equations, implemented through the
optim function in R, with the quasi-Newton method specified. Finally, we obtained the resulting estimators using the plug-in method. When considering model misspecification, we used
, instead of
, to fit the involved models.
Table 1 reports the simulation results for all four cases in Example 1. The multiply-robust method for the four estimands exhibited small bias, RMSE, and SE across all cases. The SE results were close to the RMSE results, and they both decreased as the sample size increased. Compared to the other three estimands, the estimated risk ratio tended to have a larger RMSE and SE. This is because the estimated risk ratio involved the ratio of two estimates, where a small denominator could lead to a significant increase in variance. In addition, it is worth noting that with a sample size of 5000 and all models correctly specified, the estimators exhibited the smallest SE. These simulation results support the previous theoretical results. For clarity, the results of naive estimates that ignored measurement errors are not included in the table. Naive estimates were severely biased, for example, with a sample size of 5000 and correctly specified propensity scores, the bias of the naive estimate of
using the inverse probability weighting method based on (
1) was close to
. Moreover, the bias did not significantly decrease even with a larger sample size.
In Example 2, let
, where
comes from the standard normal distribution. Let
denote the indicator function that takes one when the input is greater than zero, and zero otherwise. The treatment
T, the true outcome
Y, and the misclassified outcome
W were generated from the following models:
where
, and
Under this data generation mechanism, the potential outcome under the treatment level
was no less than the potential outcome under the treatment level
. On this condition, the estimation for the joint probabilities of potential outcomes could be achieved through the estimation of the potential-outcome means. We divided the generated data into two parts and combined them into a sample using the same method as in Example 1. With the use of the Monte Carlo approach, the true values of
and
in Example 2 were found to be around
and
, respectively. The choices of the index functions and the parameter estimation in the parametric models were similar to those in Example 1. We calculated the three different joint probabilities, NBR, ABR, and TBR, under the four different cases. When considering the manner of model misspecification, we used the nonlinear transformation
of the original variable
for fitting.
Table 2 reports the simulation results for Example 2. From the results, it can be observed that for the three probabilities, our proposed multiply-robust method exhibited relatively small bias, RMSE, and SE. When the sample size increased from 2000 to 5000, the RMSE and SE of the three probabilities decreased. Additionally, although the proposed method exhibited the smallest SE when all models were correctly specified, in the other three cases where two models of the involved models were misspecified, the SE of the estimators did not significantly increase. This was similar to the simulation results in Example 1.
5. Data Analysis
In this section, we aim to analyze a publicly available data set from the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is a survey carried out across all 50 states in the United States by the Centers for Disease Control and Prevention (CDC), and it gathers data on health-related behaviors (such as smoking habits), healthcare access, and chronic conditions. Detailed information on the survey design, sampling methods, data collection, and statistical weighting is available on the CDC website (
http://www.cdc.gov/brfss, accessed on 2 September 2024). The data were also previously analyzed by many other researchers [
40,
41,
42].
The main purpose of our analysis was to explore the impact of smoking on obesity (defined as having a body mass index (BMI) of at least 30 kg/m
2). We aimed to determine whether smoking had a significant effect on obesity rates and if so, whether this effect was positive or negative. This relationship can provide valuable insights into the potential health implications of smoking and inform public health strategies [
43,
44,
45]. The current smokers were determined using the SMOKER2 (Computed Smoking Status) variable. In our study, individuals who responded as “Current Smoker—Now Smokes Every Day” and “Current Smoker—Now Smokes Some Days” were classified as smokers (SMOKER2
or 2), as done in the work by Sharbaugh [
46]. The weight and height data contained in the BRFSS data set are self-reported, and prone to measurement errors, making the obesity data derived from the BMI data susceptible to inaccuracies. In other words, besides the treatment variable indicating whether an individual is a smoker, the BRFSS data set includes information on a misclassified version of the outcome variable, but not the true outcome variable. The National Health and Nutrition Examination Survey (NHANES) is another survey designed to assess the health and nutritional status of adults and children in the United States. The data set from NHANES contains not only self-reported weights and heights but also measured weights and heights; however, the data do not include data on smoking habits as the treatment variable of interest, and we used the NHANES data as the validation data set. The obesity rate calculated from the BMI information in this validation data set is considered to be accurate. The NHANES data set can be accessed at
https://www.cdc.gov/nchs/nhanes, (accessed on 2 September 2024). According to the data set, approximately ten percent of the data are reported incorrectly.
Alongside the binary treatment variable for smoking and the binary outcome variable for obesity, we incorporated age, gender, race, and education as the control variables, as these factors can influence both the treatment and the outcome. For both data sets, our analysis sample was limited to the 2018 survey year. We preprocessed the data using some standard procedures, such as removing cases with missing information, converting the calculated BMI data into a binary format indicating obesity, and encoding nominal variables.
For the purpose of model fitting, logistic regression models were employed for
,
,
, and
, and a hyperbolic tangent function was utilized for modeling
, where
represents the control covariates previously discussed. We present the analysis results in
Table 3, utilizing point estimates, standard errors, and the
confidence intervals for method evaluation. The standard errors and confidence intervals were derived using the bootstrap method with 500 bootstrap samples.
Table 3 presents the analysis results for the same four estimands as those in Example 1 of the simulation studies. The point estimates of the potential-outcome means corresponding to different treatment levels both fell within the interval
. The point estimates of the risk difference and the risk ratio indicated that smoking reduces obesity rates, which is consistent with some previous studies [
43,
47]. This is mainly because nicotine is thought to suppress appetite and increase metabolism, potentially leading to lower body weight among smokers. Therefore, smokers have a lower obesity rate compared to non-smokers. However, as noted by Munafò [
48], although smoking may be associated with a lower body weight, it poses significant health risks, including heart disease, lung disease, and cancer. From a long-term health perspective, smoking is not a healthy way to control weight and can lead to more severe health issues. In addition, for the four estimands, the standard errors calculated by the bootstrap method were small, and the estimated
confidence intervals did not include zero. The results suggest that the proposed estimation method may perform satisfactorily in real-world scenarios. Additionally, when the monotonicity assumption held (in our example, this meant that smoking did not cause weight gain for any individual), we could also calculate the NBR, ABR, and TBR.