1. Introduction
In the realm of causal inference, randomized experiments stand as the gold standard for estimating treatment effects (
Imbens and Rubin 2015;
Rubin 1974). By randomly assigning individuals to treatment and control groups, researchers can contrast outcomes to study the impact of the intervention. However, the practicality of conducting large-scale randomized experiments often falls short, either due to logistical constraints or ethical considerations. Consequently, researchers often turn to observational data to explore causal relationships, despite its inherent limitations. The cornerstone of using observational data lies in the unconfoundedness assumption (
Rosenbaum and Rubin 1983), which states that all confounding factors in the experiment are adequately controlled for. However, in reality, this assumption is often untestable and frequently violated, casting doubt on the validity of conclusions drawn from observational studies.
Due to the complementary strengths of both experimental and observational data, researchers have been proposing methods to combine both experimental data and observational data to estimate treatment effects. For example,
Kallus et al. (
2018) estimates the bias from using observational data to estimate the average treatment effect with a linear estimator, and
Rosenman et al. (
2023) takes the advantage of the classic James–Stein shrinkage estimator (
James and Stein 1992) to combine the estimates from observational data and experimental data while assuming unconfoundedness. In this paper, we address this challenge within a specific context, where observational data include both the outcome of interest and a surrogate outcome, while experimental data only provide the surrogate outcome. Our objective is to propose an easily implementable estimator that leverages both sources of data to estimate the average treatment effect of interest. By bridging the gap between observational and experimental data, our approach offers a robust and reliable method for treatment effect estimation.
The remainder of the paper is organized as follows.
Section 2 introduces the basic setup. In
Section 3, we develop our method to estimate the treatment effect of primary outcome by using information from the experimental study. In
Section 4, we discuss several widely studied extensions to the basic setup and give a concrete solution to each extension.
Section 5 compares several different methods through simulations.
2. Setup
Suppose we want to estimate the treatment effect of an intervention on a primary outcome . We consider leveraging the data from two distinct studies: an observational study and an experimental study. For each unit i in the observational study, we collect data on treatment assignments , the primary outcome , a surrogate outcome , and a set of pre-treatment covariates . The surrogate outcome is any variable that changes post-treatment, and while we primarily discuss the case where is one dimensional, our methodology is readily extendable to multi-dimensional surrogate outcomes.
Under the assumption of unconfoundedness, i.e.,
either the Inverse Probability Weighting (IPW) estimator or the Augmented Inverse Probability Weighting (AIPW) estimator would suffice for estimating the treatment effect. However, there are numerous scenarios in which the assumption of unconfoundedness is not plausible. In such cases, estimating the treatment effect using only the observational data becomes infeasible.
To address this challenge, we introduce a secondary source of data: a prior study focusing on the surrogate outcome , where the assumption of unconfoundedness is satisfied. Typically, this prior study takes the form of a small-scale randomized experiment concerning the surrogate outcome. Therefore, we operate with two samples: an observational sample and an experimental sample. Each unit i in the observational sample provides a tuple , while each unit i in the experimental sample provides a tuple . It is important to note that the size of the experimental sample is significantly smaller than the size of the observational sample .
Our primary objective is to estimate the quantity
where
is an indicator function denoting the sample to which unit
i belongs. This setup is consistent with the framework presented by
Athey et al. (
2020).
By integrating data from both the observational and experimental studies, we aim to leverage the strengths of each approach. The observational study provides a large sample size and detailed covariate information, while the experimental study offers reliable causal inference for the surrogate outcome under the unconfoundedness assumption. This combined approach allows us to robustly estimate the treatment effect on the primary outcome , even in the presence of potential confounding factors in the observational study.
3. Method
In this section, we develop our method to estimate the average treatment effect (ATE) of
. To achieve point identification of the ATE, we assume the following structural model for
:
where
is independent of
and
. Note that this structural model is general in the sense that the primary outcome can depend on the pre-treatment covariates in an arbitrary way. We assume the errors to be exogenous. Our estimating procedure can be extended to the case of endogenous error settings like the instrumental variable setup easily. This model implies that all the effect of treatment on the primary outcome is mediated through the surrogate outcome. Consequently, the surrogate outcome, together with the pre-treatment covariates, determines the primary outcome. Under this assumption,
is identifiable.
To see this, we define
and
Then, can be expressed as . The joint distribution of and is identifiable from the experimental sample due to unconfoundedness.
Consider a concrete model where with being independent of and . For such a model, we can use the Robinson residual-in-residual method to estimate , ensuring the final estimate of the ATE is consistent. For the general case, we can estimate using the following procedure:
Regress on and X to obtain an estimate of , denoted as .
Estimate the conditional average treatment effect on the surrogate outcome , obtaining an estimate .
Define if and if .
Estimate by .
Define if and if .
Estimate by .
The final estimate is .
With this procedure, we can estimate the ATE on the primary outcome using a single model for the conditional response function and one model for the conditional average treatment effect (CATE) estimation. In the following sections, we will discuss various adaptations of this procedure for different scenarios.
4. Applications
In the previous section, we develop a general procedure to combine both the experimental sample and the observational sample. It relies on first estimating the conditional average treatment effect on the surrogate outcome and then correcting the surrogate outcomes in the observational sample. Estimating the conditional average treatment effect (CATE) is usually a case-by-case problem and involves different estimation methods for different settings. In this section, we discuss four settings where we can apply the estimator in
Section 3 with different versions of step 2. We also discuss the setting where we drop the unconfoundedness assumption on the experimental sample. In fact, as long as the conditional average treatment effect
is identifiable, unconfoundedness is not necessary.
4.1. Covariate Support Mismatch between Samples
The first scenario we consider aligns with the setting discussed in
Kallus et al. (
2018), where the support of pre-treatment covariates in the experimental sample differs from that in the observational sample. We tackle this scenario by adding a calibration step on top of estimating procedure. Such a situation often arises in practice because the experimental sample typically derives from historical data, making it unlikely that the experimental and observational studies target identical populations. Under these circumstances, using only the experimental sample to estimate the conditional average treatment effect (CATE) requires extrapolation to the observational sample. Such extrapolation becomes particularly problematic when the experimental sample size is much smaller than that of the observational sample. Therefore, it is essential to calibrate our CATE estimates on the experimental sample to avoid potential biases.
Kallus et al. (
2018) observed that if we define
and
, then
We now define
as
then the above observation motivates the following procedure to estimate the CATE of the surrogate outcome on the observational sample:
Using the above estimate of
, we can proceed with the estimator described in
Section 3. We can view the above steps as performing additional calibrations. The core idea is to leverage a loss function to estimate the difference between the ill-posed target
and the true quantity of interest
. A more general approach can be achieved by fitting a non-parametric function of
instead of a linear function.
In summary, this procedure helps to mitigate the issues arising from covariate support mismatch between the experimental and observational samples. By calibrating the CATE estimates from the experimental sample with information from the observational sample, we improve the robustness and reliability of our treatment effect estimates.
4.2. Instrumental Variable (IV) Setting in the Experimental Sample
In this section, we drop our unconfoundedness assumption on the experimental sample and consider the instrumental variable setting which is widely studied in econometrics literature. Note that without the unconfoundedness assumption, point identification is limited to a few specific settings like instrumental variables. Future work could include extending our estimating procedures to the setting where only observational data are available and incorporate our approach with the existing literature on estimating conditional average treatment effects with observational data (
Wang et al. 2022;
Wu and Yang 2022;
Xie et al. 2012).
4.2.1. Constant Effect
We start with the simplest instrumental variable setting where the effect is constant. In particular, we consider a setting where in the experimental sample, we have an instrumental variable
Z with the following structural model:
Such a model is introduced in almost every econometrics textbook, for example, in
Angrist and Pischke (
2009). It can be seen easily that the parameter
is exactly the conditional average treatment effect of
. It is well known that we can then estimate
by two-stage least squares (2SLS) in the instrumental variable literature.
4.2.2. Non-Parametric IV
Now, we consider a more general instrumental variable setting. Specifically, we consider the following model:
where the effect is a function of the covariates rather than a constant. This is in fact a special case of the more general non-parametric instrumental variable model (
Hall and Horowitz 2005;
Horowitz 2011;
Newey and Powell 2003). To estimate
, we can follow
Hall and Horowitz (
2005). First, note that
If we define
and integrate both sides of (
2) with respect to
z, then we have
for any
, where the expectation on the left-hand side is taken with respect to the conditional joint distribution
. If we define
and
then we arrive at the following operator equation
We can estimate using the Hall–Horowitz estimator. Similarly, we have another operator equation, where only g is involved by conditioning on . With that equation, we are able to estimate g. Then, we can estimate by taking the difference.
Hall and Horowitz (
2005) give good theoretical properties of this method. However, it involves estimating density functions, which is unstable in practice. In fact,
Hall and Horowitz (
2005) aim to address the general non-parametric IV problem, while we only care about
.
While our structural model assumption represents a specific case within the broader framework of non-parametric instrumental variable (IV) models, we can leverage alternative methods for more general applications. Specifically, the generalized random forest (GRF) methodology, proposed by
Athey et al. (
2019), offers a flexible and computationally efficient approach for estimating the conditional average treatment effect (CATE), especially under our structural model assumption.
The GRF framework extends the traditional random forest algorithm to accommodate the estimation of heterogeneous treatment effects (more broadly, any quantity of interest identified as the solution to a set of local moment equations
Athey et al. (
2019)).
We recommend the use of GRF for estimating for the following two reasons:
Flexibility: GRF is capable of modeling complex, non-linear relationships between the covariates and the treatment effect, which is often essential in practical applications where such relationships are not adequately captured by parametric models.
Generalizability: One notable advantage of GRF is its ability to generalize beyond binary treatment variables. As discussed in
Athey et al. (
2019), GRF can be extended to settings where the treatment variable
W is a real-valued continuous variable.
4.3. Instrumental Variable Setting with Different Support of Pre-Treatment Covariates
In this section, we address the scenario where we have different support of pre-treatment covariates and a non-parametric instrumental variable (IV) model for the experimental sample. This approach is particularly relevant when considering complex experimental designs with multi-dimensional covariates.
To formalize our setup, we define the following conditional expectations:
Given these definitions, it follows that
Thus, we can write the parameter of interest
as the solution to the following minimization problem:
The direct estimation of using the above loss function is possible; however, it proves to be inefficient in practice, particularly when dealing with multi-dimensional pre-treatment covariates. This inefficiency arises from the need to estimate numerous nuisance parameters, leading to error accumulation and reduced robustness.
Inspired by the loss-defining property of
, we propose an alternative estimation procedure, which we term as the Kallus IV method, adapted from
Kallus et al. (
2018). The procedure is as follows:
Apply any conditional average treatment effect (CATE) estimation algorithm, denoted by , to the set to obtain an initial estimate .
Solve the following optimization problem on the experimental sample to refine the estimate:
Use the combined estimate as the final estimate of the CATE on the surrogate.
This procedure leverages the initial non-parametric estimate and refines it using an optimization framework that adjusts for the instrumental variables’ influence. While the optimization step is currently formulated linearly in , it is worth noting that a non-parametric function of could be fitted instead. However, empirical results indicate that non-parametric adjustments may lead to unstable estimates when the dimensionality of covariates is high, emphasizing the trade-off between flexibility and stability.
The Kallus IV method thus offers a robust approach to estimate the CATE in the presence of multi-dimensional covariates and instrumental variables.
5. Simulations
In the previous sections, we outlined a procedure to estimate the average treatment effect of the primary outcome using prior information in the experimental sample. We considered three scenarios in which our procedure, as described in
Section 3, can be utilized. In this section, we compare several estimators through a series of simulations. Our primary objective is to compare our proposed procedure with the canonical imputation estimator presented by
Athey et al. (
2020), particularly in cases where we have an unconfounded experimental sample.
We consider two primary settings in our analysis: one where there is no confounding in the experimental sample (i.e., we have either a randomized experiment or an unconfounded experiment) and another where there is confounding (assuming a non-parametric instrumental variable (IV) model for the experimental sample). For each of these settings, we further divide our analysis into two subcases: (1) the support of the pre-treatment covariates in the experimental sample is the same as the support of pre-treatment covariates in the observational sample, and (2) the support of the pre-treatment covariates in the experimental sample is not the same as the support in the observational sample, though they do overlap.
When there is no confounding, we compare three estimators: (1) the imputation estimator as described by
Athey et al. (
2020), (2) our estimator with
estimated using a generalized random forest, and (3) our estimator with
estimated using the approach by
Kallus et al. (
2018). In the presence of confounding, both the imputation estimator and the approach by
Kallus et al. (
2018) become invalid, as they require the experimental sample to be unconfounded. Therefore, in these scenarios, we compare two estimators: (1) our estimator with
estimated by a generalized random forest and (2) our estimator with
estimated using the Kallus IV approach.
The simulations are designed to provide a robust comparison of these estimators under varying conditions of confounding and covariate support. By doing so, we aim to identify the strengths and limitations of each method, particularly focusing on the performance of our proposed estimator in different scenarios.
We work with the following data generating mechanism:
and
i.e.,
, where
is independent noise. The data generation mechanism above is the same as in the appendix of
Athey et al. (
2019).
Now, we can adjust several parameters in the data generation mechanism to satisfy different conditions:
Presence of Confounding: We vary to be either 0 or 1. If , there is no confounding; otherwise, there is confounding, and we are in the non-parametric instrumental variable (IV) model.
Sparsity of the Signal: We set to be either 2 or 4 to vary the sparsity of the signal.
Additivity of the Signal: When true, ; when false, .
Presence of Nuisance Terms: When true, or depending on the additive signal condition; when false, .
Identical Support: When true, we assume the distribution of the covariates in the experimental sample and that in the observational sample are the same; when false, in the observational sample.
In our setup, we fix the dimension of (p) to be 10, the experimental sample size (n) to be 300, and the observational sample size (m) to be 1000. We are particularly interested in the treatment effect on .
To evaluate different methods, we compare their performance based on the mean squared error (MSE). To calculate MSE, we use the Monte Carlo method to estimate the true value of the average treatment effect (ATE) and generate 200 realizations. This approach allows us to robustly assess the accuracy and reliability of the various methods under different conditions.
Table 1 and
Table 2 present the simulation results. We observe that when the support of the pre-treatment covariates is identical, the generalized random forest (GRF) method outperforms the other two methods, regardless of the presence of confounding. This outcome is expected, as identical support implies no need for extrapolation, rendering the improvements from the Kallus method minimal. Conversely, when the support of pre-treatment covariates differs, both the Kallus and Kallus IV methods demonstrate competitive performance. Notably, in the presence of confounding, the Kallus IV method surpasses the GRF method in terms of performance.
To further explore the scenario of differing supports, we modify the previous setting slightly. Specifically, we now assume that when the support of the pre-treatment covariates is not identical, the support of the pre-treatment covariates in the experimental sample is contained within the support of the pre-treatment covariates in the observational sample, rather than merely overlapping. Specifically, we have the following:
- 5a
Identical support: When true, we assume the distribution of the covariates in the experimental sample and that in the observational sample are the same: Uniform. When false, Uniform in the experimental sample and in the observational sample.
Table 3 shows the simulation results. We see that similar to the simulation results in the previous two tables, Kallus/Kallus IV performs better than GRF when we have different support.
6. A Real Data Example
In this section, we investigate the performance of our procedure on a real dataset. We provided several applications in
Section 4 and simulation studies in the previous section. In this section, we use a real world example to demonstrate the robustness of our procedure on real data. We utilize the famous Tennessee STAR study (
Achilles et al. 2008). The Tennessee Student/Teacher Achievement Ratio (STAR) study was a large-scale, longitudinal educational experiment conducted in the late 1980s to examine the effects of class size on student performance. In this study, over 7000 students from kindergarten to third grade across 79 schools were randomly assigned to one of three types of classrooms: small classes (13–17 students), regular-sized classes (22–25 students), or regular-sized classes with a teacher’s aide. The goal of the study was to assess whether smaller class sizes would lead to improved academic outcomes, such as higher test scores and long-term achievement. This dataset is also used in
Kallus et al. (
2018) and
Athey et al. (
2020). We use it in a different manner. Specifically, we select the following covariates for each student: gender, race, birth month, birthday, birth year, free lunch given or not, teacher id, and student home location. We focus on two outcomes: average grade in year 1 and average grade in year 3. We remove all the records with missing outcome variables. Now, in this study, the treatment is whether or not the student is in a small class (treatment) or regular class (control). After cleaning the data, we have a dataset with 2498 units, 9 covariates, 1 treatment variable and 2 outcome variables. We use the method in
Athey et al. (
2020) to generate a large population, which we view as the ground truth. We call this ground truth dataset
. To assess different methods, we perform the following:
Use to calculate the average treatment effect of average grade in year 3. This estimate will be viewed as the ground truth.
Repeat the following steps 500 times.
Sample rural or inner-city students together with all the covariates except the student location covariate, treatment variable and average grade in year 1. This is our experimental sample .
Sample rural or inner-city students in control group that are not sampled in experimental sample, sample rural or inner-city students in treatment group whose year 1 average grade is in the lower half among treated rural or inner-city students, sample urban or suburban students in control group, and finally sample urban or suburban students in treatment group whose year 1 average grade is in the lower half among treated urban or suburban students. This is our observational sample (which is confounded because we remove students with higher scores selectively from the population) .
Use different methods to estimate based on and .
Compare based on mean squared error (MSE).
We will only compare the GRF and imputation estimators, as the Kallus method involves estimating the coefficient of a linear function of the covariates but we only have categorical variables. We also include the mean squared error of the AIPW estimator (notice that AIPW estimator requires the sample to be unconfounded) on observational sample.
Table 4 gives the results. We see that in general, the GRF estimator outperforms the imputation estimator, and these two estimators both outperform the AIPW estimator significantly. In particular, as
Table 5 shows, the empirical mean of AIPW estimates is actually a negative number (and the true treatment effect is a positive number) and is far from the true treatment effect.
7. Conclusions
In this paper, we proposed a straightforward procedure to estimate the average treatment effect (ATE) of the primary outcome in an observational study by leveraging an experimental study for the surrogate outcome. We demonstrated that our procedure is applicable in various settings, provided that the conditional average treatment effect (CATE) of the surrogate outcome can be accurately estimated. Through a series of simulations, we compared several methods and showed that our procedure produces a more precise estimate, in terms of mean square error (MSE), than the canonical imputation estimator proposed by
Athey et al. (
2020).
Our method’s robustness was examined across different scenarios, including settings with and without confounding, as well as cases with identical and varying supports of pre-treatment covariates between experimental and observational samples. Furthermore, in our simulation study, we extended our discussions to scenarios where the support of pre-treatment covariates in the experimental sample is contained within the support of pre-treatment covariates in the observational sample. This setting provided additional insights into the estimators’ performance under more structured support conditions, further demonstrating the effectiveness of our proposed procedure.