1. Introduction
Estimation of causal parameters such as the average treatment effect (ATE) in observational data requires confounder adjustment. The estimation and inference are carried out in two steps: In step 1, the treatment and outcome are predicted by a statistical models or machine learning (ML) algorithm, and in the second step the predictions are inserted into the causal effect estimator. If ML algorithms are employed in step 1, the non-linear relationships can potentially be taken into account. The relationship between the confounders and the treatment and outcome can be non-linear which make the application of machine learning (ML) algorithms, which are non-parametric models, appealing. Farrell et al. [
1] proposed to use two separate neural networks (double NNs or dNNs) where there is no regularization on the network’s parameters except the stochastic gradient descent (SGD) in the NN’s optimization [
2,
3,
4,
5]. They derive the generalization bounds and prove that the NN’s algorithms are fast enough so that the asymptotic distribution of causal estimators such as the augmented inverse probability weighting (AIPW) estimator [
6,
7,
8] will be asymptotically linear, under regulatory conditions and the utilization of cross-fitting [
9].
Farrell et al. [
1] argue that the fact that SGD-type algorithms control the complexity of the NN algorithm to some extent [
2,
10] is sufficient for the first step. Our initial simulations and analyses, however, contradict this claim in scenarios where strong confounders and instrumental variables (IVs) exist in the data.
Conditioning on IVs is harmful to the performance of the causal effect estimators such as ATE (Myers et al. [
11]) but there may be no prior knowledge about which covariates are IVs, confounders or otherwise. The harm comes from the fact that the complex NNs can provide near-perfect prediction in the treatment model which violates the empirical positivity assumption [
12].
The positivity assumption (
Section 2) is fundamental to hold to have an identifiable causal parameter in a population. However, in a finite sample, although the parameter is identifiable by making the positivity assumption, the bias and variance of the estimator can be inflated if the estimated propensity scores are close to zero or one bounds (or become zero or one by rounding errors). This is referred to as the empirical positivity assumption which is closely related to the concept of sparsity studied in Chapter 10 of Van der Laan and Rose [
8]. The violation of the empirical positivity assumption can cause the inflation of the bias and variance of inverse probability weighting (IPW)-type and AIPW-type estimators.
The inverse probability weighting method dates at least back to Horvitz and Thompson [
13] in the literature of sampling with unequal selection probabilities in sub-populations. IPW-type and matching methods have been extensively studied Lunceford and Davidian [
7], Rubin [
14], Rosenbaum and Rubin [
15,
16], Busso et al. [
17]. IPW is proven to be a consistent estimator of ATE if the propensity scores (that are the conditional probability of treatment assignments) are estimated by a consistent parameter or non-parametric model. The other set of ATE estimators include those involving the modeling of the outcome and inserting the predictions directly into the ATE estimator (
Section 2). They are referred to as single robust (SR) estimators as they provide
-consistent estimators for ATE if the outcome model is
-consistent. In this sense, IPW is also single robust as it is consistent if the treatment (or the propensity score) model is
-consistent. The focus of this work is to study the
augmented IPW-type methods as they involve modeling both treatment and outcome and can be
-consistent estimators of ATE if either of the models is consistent.
We propose and study a simple potential remedy to the empirical positivity violation issue by studying the normalization of the AIPW estimator (similar to the normalization of IPW [
7]), here referred to as nAIPW. In fact, both AIPW and nAIPW can be viewed as a more general estimator which is derived via the efficient influence function of ATE [
18,
19].
A general framework of estimators that includes nAIPW as a special case was proposed by [
20]. In their work, the authors did not consider machine learning algorithms for the first-step estimation, but rather assumed parametric statistical models estimated by likelihood-based approaches. They focused on how to consistently estimate ATE within different sub-populations imposed by the covariates. There is a lack of numerical experimentation on these estimators especially when IVs and strong confounders exist in the set of candidate covariates.
To the best of our knowledge, the performance of nAIPW has not been previously studied in the machine learning context, with the assumption that strong confounders and IVs exist in the data. We will prove that this estimator has the doubly robust [
6] and the rate doubly robust [
19] property, and illustrate that it is robust against extreme propensity score values. Further, nAIPW (similar to AIPW), has the orthogonality property [
9] which means that it is robust against small variations in the predictions of the outcome and treatment assignment predictions. One theoretical difference is that AIPW is the most efficient estimator among all the double robust estimators of ATE given both treatment and outcome models are correctly specified [
21]. In practice, however, often there is no a priori knowledge about the true outcome and propensity score relationships with the input covariates and thus this feature of AIPW is probably of less practical use.
We argue that for causal parameter estimation, dNN with no regularization may lead to high variance for the causal estimator used in the second step. We compare AIPW and nAIPW through a simulation study where we allow for moderate to strong confounding and instrumental variable effects, that is, we allow for possible violation of the empirical positivity assumption. Further, a comparison between AIPW and nAIPW is made on the Canadian Community Health Survey (CCHS) dataset where the intervention/treatment is the food security vs. food insecurity and the outcome is individuals’ body mass index (BMI).
Our contributions include presenting the proof for the orthogonality, doubly robust and rate doubly robust property of nAIPW. Further, it is proven that, under certain assumptions, nAIPW is asymptotically normal and we provide its consistent variance estimator. We analyze the estimation of ATE in the presence of not only confounders, but also IVs, y-predictors and noise variables. We demonstrate that in the presence of strong confounders and IVs, if complex neural networks without regularizations are used in the step 1 estimation, both AIPW and nAIPW estimators and their asymptotic variances perform poorly, but, relatively speaking, nAIPW performs better. In this paper, the NNs are mostly used as means of estimating the outcome and treatment predictions.
Organization of the article is as follows. In
Section 2 we will formally introduce the nAIPW estimator to the readers and state its double robustness property, and in
Section 3 we present the first-step prediction model, double neural networks. In
Section 4 and
Section 5 we will present the theoretical aspects of the paper, including the asymptotic normality, doubly robustness and rate doubly robustness orthogonality of the proposed estimator (nAIPW) and the asymptotic normality. We will present the simulation scenarios and results of comparing the nAIPW estimator with other conventional estimators in
Section 6. We apply the estimators on a real dataset in
Section 7. The article will be concluded with a short discussion on the findings in
Section 8. The proofs are straightforward but long and thus are included in
Appendix A.
2. Normalized Doubly Robust Estimator
Let data be generated by a data generating process P, where is a finite dimensional vector , with being the adjusting factors. P is the true observed data distribution, is the distribution of such that its marginal distribution with respect to W is its empirical distribution and the expectation of the conditional distribution , for , can be estimated. We denote the prediction function of the observed outcome given explanatory variables in the treated group , and that in the untreated group , and the propensity score as . Throughout, the expectations are with respect to P. The symbol on the population-level quantities indicates the corresponding finite sample estimator, and P is replaced by .
Let the causal parameter of interest be the average treatment effect (ATE)
where
and
are the potential outcomes of the treatment and controls [
6].
For identifiablity of the parameter, the following assumptions must hold true. The first assumption is the conditional independence, or unconfoundedness stating that, given the confounders, the potential outcomes are independent of the treatment assignments (). The second assumption is positivity which entails that the assignment of treatment groups is not deterministic (). The third assumption is consistency which states that the observed outcomes equal their corresponding potential outcomes (). There are other modeling assumptions made such as time order (i.e., the covariates W are measured before the treatment), IID subjects and a linear causal effect.
A list of first candidates to estimate ATE are
The naive average treatment effect (naive ATE) is a biased (due to the selection bias) estimator of ATE [
22] and is the poorest estimator among all the candidates. The single robust (SR) is not an orthogonal estimator [
9] and if ML algorithms which do not belong to the Donsker class ([
23], Section 19.2) or have entropy that grows with the sample size are used, this estimator also becomes biased and is not asymptotically normal. The inverse probability weighting (IPW) [
13] and its normalization versions adjust (or weight) the observations in the treatment and control groups. IPW and nIPW are also not orthogonal estimators and are similar to SR in this respect. In addition, both
and
(and
) are single robust, that is, they are consistent estimators of ATE if the models used are
-consistent [
7]. IPW is an unbiased estimator of ATE if
g is correctly specified, but nIPW is not unbiased, but is less sensitive to extreme predictions. The augmented inverse probability weighting (AIPW) estimator [
21] is an improvement over SR, IPW and nIPW, which involves the predictions for both treatment (the propensity score), and the causal parameter can be expressed as:
and the sample version estimator of (
3) is
where
and
.
Among all the doubly robust estimators of ATE, AIPW is the most efficient estimator if both of the propensity score or outcome models are correctly specified, but is not necessarily efficient under incorrect model specification. In fact, this nice feature of AIPW may be less relevant in real-life problems as we might not have a priori knowledge about the predictors of the propensity score and outcome and we cannot correctly model them. Further, in practice, perfect or near-perfect prediction of the treatment assignment can inflate the variance of the AIPW estimator [
8]. As a remedy, similar to the normalization of the IPW estimator, we can define a normalized version of the AIPW estimator which is less sensitive to extreme values of the predicted propensity score, referred to as the normalized augmented inverse probability weighting (nAIPW) estimator:
where
and
. Both AIPW and nAIPW estimators add adjustment factors to the SR estimator which involve both models of the treatment and the outcome.
Both AIPW and nAIPW are examples of a class of estimators where
where we refer to this general class as the general doubly robust (GDR) estimator. Letting
and
gives the AIPW estimators and letting
and
gives the nAIPW estimator.
The GDR estimator can also be written as
If
and
are chosen so that
by the total law of expectation
is an unbiased estimator of
.
3. Outcome and Treatment Predictions
The causal estimation and inference when utilizing the AIPW and nAIPW is carried out in two steps. In step 1, the treatment and outcome are predicted by a statistical or machine learning (ML) algorithm, and in the second step the predictions are inserted into the estimator. The ML algorithms in step 1 can capture the linear and non-linear relationships between the confounders and the treatment and the outcome.
Neural networks (NNs) [
2,
3,
4] are a class of non-linear and non-parametric complex algorithms that can be employed to model the relationship between any set of inputs and some outcome. There has been a tendency to use NNs as they have achieved great success in the most complex artificial intelligence (AI) tasks such as computer vision and natural language understanding [
2].
Farrell et al. [
1] used two independent NNs for modeling the propensity score model and the outcome with the rectified linear unit (RELU) activation function [
2], here referred to as the double NN or dNN:
where two separate neural nets model
y and
A (no parameter sharing). Farrell et al. [
1] proved that dNN algorithms almost attain
-rates. By employing the cross-fitting method and theory developed by Chernozhukov et al. [
9], an orthogonal causal estimator is asymptotically normal, under some regularity and smoothing conditions, if the dNN is used in the first step (see Theorem 1 in [
1]).
These results assume no regularizations imposed on the NNs’ weights, and only the stochastic gradient descent (SGD) is used. Farrell et al. claim that the fact that SGD controls the complexity of the NN algorithm to some extent [
2,
10] is sufficient for the first step. Our initial simulations, however, contradict this claim and we hypothesize that for causal parameter estimation, a dNN with no regularization leads to high variance for the causal estimator used in the second step. Our initial experiments indicate that
regularization and dropout do not perform well in terms of the mean square error (MSE) of AIPW. The loss functions we use contain
regularization (in addition to SGD during the optimization):
where
are hyperparameters that can be set before training or be determined by cross-validation, that can cause the training to pay more attention to one part of the output layer. The dNN can have an arbitrary number of hidden layers, or the width of the network (
) can be another hyperparameter. For a three-layer network,
, where
is the number neurons in layer
j,
.
are the connection parameters in the non-linear part of the networks, with
s being shared for the two outcome and propensity models. Note that the gradient descent-type optimizations in the deep learning platforms (such as pytorch in our case) do not cause the NN parameters to shrink to zero.
5. Asymptotic Sampling Distribution of nAIPW
Replacing
g in the denominator of the von Mises expansion (
12) with the normalizing terms is enough to achieve the asymptotic distribution of the nAIPW and its asymptotic standard error. However, we can see that nAIPW is also the solution to (extended) estimating equations. The solution to the estimating equations is important as van der Vaart (Chapters 19 and 25) proves that under certain regulatory conditions, if the prediction models belong to the Donsker class, the solutions to Z-estimators are consistent and asymptotically normal ([
23], Theorem 19.26). Thus, nAIPW that is the solution to a Z-estimator (also referred to an M-estimator) will inherit the consistency and asymptotic normality, assuming certain regulatory conditions and that the first-step prediction models belong to the Donsker class:
The Donsker class assumption prevents too complex algorithms in the first step, algorithms such as tree-based models, NNs, cross-hybrid algorithms or their aggregations [
19,
27]. The Donsker class assumption can be relaxed if sample splitting (or cross-fitting) is utilized and the target parameter is orthogonal [
9]. In the next section we see that nAIPW is orthogonal and, thus, theoretically, we can relax the Donsker class assumption under certain smoothing regulatory conditions. Before seeing the orthogonality property of nAIPW, let us review the smoothing regularity conditions necessary for asymptotic normality. Let
be the causal parameter,
be the infinite dimensional nuisance parameters where
T is a convex set with a norm. Additionally, let the score function
be a measurable function,
be the measurable space of all random variables
O with probability distribution
and
be an open subset of
containing the true causal parameter. Let the sample
be observed and the set of probability measures
expand with sample size
n. In addition, let
be the solution to the estimating equation
. The assumptions that guarantee that the second-step orthogonal estimator
is asymptotically normal are [
9]: (1)
does not fall on the boundary of
; (2) the map
is twice Gateauax differentiable (this holds by the positivity assumption).
is identifiable; (3)
is smooth enough; (4)
with high probability and
.
converges to
at least as fast as
(similar but slightly stronger than first two assumptions in (
17)); (5) score function(s)
has finite second moment for all
and all nuisance parameters
; (6) the score function(s)
is measurable; (7) the number of folds increases by sample size.
5.1. Orthogonality and the Regulatory Conditions
The orthogonality condition [
9] is a property related to the estimating equations
We refer to an estimator drawn from the estimating Equation (
27) as an orthogonal estimator.
Let
, where
T is a convex set with a norm. Additionally, let the score functions
be a measurable function,
is measurable space of all random variables
O with probability distribution
and
is an open subset of
containing the true causal parameter. Let the sample
be observed and the set of probability measures
can expand with sample size
n. The score function
follows the Neyman orthogonality condition with respect to
, if the Gateauax derivative operator exists for all
:
Chernozhukov et al. [
24] presents a few examples of orthogonal estimating equations including the AIPW estimator (
4). Utilizing cross-fitting, under standard regulatory conditions, the asymptotic normality of estimators with orthogonal estimating equations is guaranteed even if the nuisance parameters are estimated by ML algorithms not belonging to the Donsker class and without finite entropy conditions [
24]. The regulatory conditions to be satisfied are (1)
does not fall on the boundary of
; (2) the map
is twice Gateauax differentiable.
is identifiable; (3)
is smooth enough; (4)
with high probability and
.
converges to
at least as fast as
; (5) score function(s)
has finite second moment for all
and all nuisance parameters
; (6) the score function(s)
is measurable; (7) the number of folds increases by sample size.
By replacing
and
in the first line of (
26) with their solutions in the second and third equations:
Implementing the orthogonality condition (
28), it can be verified that nAIPW (
5) is also an example of an orthogonal estimator. To see this, we apply the definition of orthogonality [
9]:
where
,
, and
, and for some functions
a and
b. The last equality is because
,
,
and
, under correct specification of the propensity score
g.
Thus, nAIPW is orthogonal, and by utilizing cross-fitting for the estimation, nAIPW is consistent and asymptotically normal, under certain regulatory conditions.
5.2. Asymptotic Variance of nAIPW
To evaluate the asymptotic variance of nAIPW, we employ the M-estimation theory [
23,
28]. For causal inference for M-estimators, the bootstrap for the estimation of causal estimator variance is not generally valid even if the nuisance parameter estimators are
-convergent. However, sub-sampling
m out of
n observations [
29] can be shown to be universally valid, provided
and
. In practice, however, we can face computational issues since nuisance parameters must be separately estimated (possibly with ML models) for each subsample/bootstrap sample.
The variance estimator of AIPW (
4) is [
7]
The theorem below states that the variance estimator of AIPW (
31) can intuitively extend to calculate the variance estimator of nAIPW (
5) by moving the denominator
to the square term in the summation and replacing it with
or
in the terms containing
g and
in the denominator, respectively.
Theorem 2. The asymptotic variance of the nAIPW (5) iswhere and . The proof utilizing the estimating equation technique is straightforward and is left to
Appendix A. The same result can be seen when deriving the estimator in the one-step method (see (
12) and (
14)). Since nAIPW is orthogonal,
is consistent by applying the theories of [
1,
9], if the assumptions are met, cross-fitting is used, and the step 1 ML algorithms have the required convergence rates.
The above theorem states that the variance estimator of AIPW (
31) can intuitively extend to calculate the variance estimator of nAIPW (
5) by moving the denominator
to the square term in the summation and replacing it with
or
in the terms containing
g and
in the denominator, respectively. This is intuitive because, by the law of total probability,
the first two terms is
n.
6. Monte Carlo Experiments
A Monte Carlo simulation study (with 100 iterations) was performed to compare AIPW and nAIPW estimators, where the dNN is used for the first-step prediction. There are a total of 2 case scenarios according to the size of the data. We fixed the sample sizes to be
and
, with the number of covariates being
and
, respectively. The predictors include four types of covariates: The confounders,
, instrumental variables,
, the outcome predictors,
, and the noise or irrelevant covariates,
. Their sizes for the scenarios are
and they are independent from each other and drawn from the multivariate normal (MVN) distribution as
, with
and
. The models to generate the treatment assignment and outcome were specified as
and
. The functions
select 20% of the columns and apply interactions and non-linear functions listed below (
35). The strength of the instrumental variable and confounding effects were chosen as
where
, and
where
.
The non-linearities are randomly selected from among the following functions:
where
, and
, or
and
.
The networks’ activation function is rectified linear unit (ReLU), with 3 hidden layers as large as the input size (
p), with
regularization and batch size equal to
and 200 epochs. The adaptive moment estimation (Adam) optimizer [
30] with learning rate 0.01 and momentum 0.95 was used to estimate the network’s parameters, including the causal parameter (ATE).
Simulation Results
The oracle estimations are plotted in all the graphs to compare the real-life situations with the truth. In almost all the scenarios we cannot obtain perfect causal effect estimation and inference.
Figure 1 shows the distribution of AIPW and nAIPW for different hyperparameter settings of NNs. The nAIPW estimator outperforms AIPW in almost all the scenarios. As the AIPW gives huge values in some simulation iterations, the log of the estimation is taken in
Figure 1.
We also compare the estimators in different scenarios with bias, variance and their tradeoff measures:
where
, with
s being the AIPW or nAIPW estimations in the
jth simulation round,
and
being the number of simulation rounds and
being the square root of (
31) or (
32).
Figure 2 demonstrates the bias, MC standard deviation (MC std) and the root mean square error (RMSE) of AIPW and nAIPW estimators for the scenarios where
and
, and for four hyperparameter sets (
regularization and width of the dNN). In general, in each figure of the panel, the hyperparameter scenarios in the left imply a more complex model (with less regularization or a narrower network). In these graphs, the lower the values, the better the estimator. For the smaller data size
in the left three panels, the worst results are attributed to AIPW when there is the least regularization and the hidden layers are as wide as the number of inputs. To have more clear plots for comparison, we skipped plotting the upper bounds as they were large numbers; the lower bounds are enough to show the significance of the results. In the scenarios where there are smaller numbers of hidden neurons with 0.01
regularization, the bias, variance and their tradeoff (here measured by RMSE) are more stable. By increasing the
regularization, these measures go down which indicates the usefulness of regularization and AIPW normalization for causal estimation and inference. Almost the same pattern is seen for the larger size (
) scenario, except for the bump in all the three measures in the hyperparameter scenario where regularization remains the same (
) and the numbers of neurons in the first and last hidden layers are small too. In all three measures of bias, standard deviation and RMSE, nAIPW is superior to AIPW, or at least there is no statistically significant difference between AIPW and nAIPW.
We have noted that the results of the step 1 NN architecture without regularization are too unstable and cannot be visually presented in the graphs. To avoid that, we have allowed a span of values for the regularization strengths: and . The former case is close to no regularization. So, if the results of the latter are better than the former’s, this is evidence that enough must be imposed.
Figure 3 illustrates how the theoretical standard error formulas perform in MC experiments, and how accurately they estimate the MC standard deviations. In these two graphs, smaller does not necessarily imply superiority. In fact, the best results will be achieved as long as the confidence intervals of asymptotic SEs and MC SDs intersect. In the left two scenarios where the NN’s complexity is high, the MC std and SE are far from each other. Additionally, in the hyperparameter scenarios where both the width of the NNs is small and regularization is higher, the MC std and SE are well separated. The scenario with largest regularization and wide NN architecture seems to the best scenario. That said, none of the scenarios confirm the consistency of SEs, which would likely also result in low coverage probability of the resulting confidence intervals.
7. Application: Food Insecurity and BMI
The Canadian Community Health Survey (CCHS) is a cross-sectional survey that collects data related to health status, health care utilization and health determinants for the Canadian population in multiple cycles. The 2021 CCHS covers the population 12 years of age and over living in the ten provinces and the three territorial capitals. Excluded from the survey’s coverage are: Persons living on reserves and other Aboriginal settlements in the provinces and some other sub-populations that altogether represent less than 3% of the Canadian population aged 12 and over. Examples of modules asked in most cycles are: General health, chronic conditions, smoking and alcohol use. For the 2021 cycle, thematic content on food security, home care, sedentary behavior and depression, among many others, was included. In addition to the health component of the survey are questions about respondent characteristics such as labor market activities, income and socio-demographics.
In this article, we use the CCHS dataset to investigate the causal relationship of food insecurity and body mass index (BMI). Other gathered information in the CCHS is used which might contain potential confounders, y-predictors and instrumental variables. The data are from a survey and need special methods such as the resampling or bootstrap methods to estimate the standard errors. However, here, we use the data to illustrate the utilization of a dNN on the causal parameters in the case of empirical positivity violation. In order to reduce the amount of variability in the data, we have focused on the sub-population 18–65 years of age.
Figure 4 shows the ATE estimates and their 95% asymptotic confidence intervals with nIPW, DR and nDR methods, with four different neural networks which vary in terms of width and strength of
regularization. The scenario that results in the largest
(as a measure of outcome prediction performance) outperforms the other scenarios. The scenario that results in the largest AUC (as a measure of treatment model performance) results in the largest confidence intervals. This is because of more extreme propensity scores in this scenario. It is worth noting that the normalized IPW has smaller confidence intervals as compared to AIPW. However, as we do not know the truth about the ATE in this dataset, we can never know which estimator outperforms the other. To gain insight about this using the input matrix of this data, we simulated multiple treatments and outcomes with small to strong confounders and IVs and compared AIPW and nAIPW. In virtually all of them, the nAIPW is the best one. We do not present these results in this paper, but they can be provided to readers upon request.
8. Discussion
Utilizing machine learning algorithms such as NNs in the first-step estimation process is comforting as the concerns with regard to the non-linear relationships between the confounders and the treatment and outcome are addressed. However, there is no free lunch, and using NNs has its own caveats including theoretical as well as numerical challenges. Farrell et al. [
1] addressed the theoretical concerns where they calculated the generalization bounds when two separate NNs are used to model the treatment and the outcome. However, they did not use or take into account regularization techniques such as
or
regularization. As NNs are complex algorithms, they provide perfect prediction for the treatment when the predictors are strong enough (or might overfit). Through Monte Carlo (MC) simulations, we illustrated that causal estimation and inference with double NNs can fail without the usage of regularization techniques such as
and/or extreme propensity scores are not taken care of. If
regularization is not used, the normalization of the AIPW estimator (i.e., nAIPW) is advised to be employed as it dilutes the extreme predictions of the propensity score model and provides better bias, variance and RMSE. Our scenario analysis also showed that in the case of violation of the empirical positivity assumption in AIPW, normalization helps avoid blowing up the estimator (and standard error), but might be ineffective in taking into account confounding effects for some observations.
We note that the nAIPW estimator cannot perform better when the empirical positivity is violated as compared to when it is not. However, when the empirical positivity is violated, nAIPW can perform better than AIPW. If the empirical positivity is not violated, our results indicated that AIPW outperforms nAIPW.
An alternative estimator might be trimming the propensity scores to avoid extreme values. However, the causal effect estimator will no longer be consistent and there is no determined method for where to trim. We hypothesize that and where will result in a consistent estimator, making the right assumptions, and will outperform both AIPW and nAIPW in the case of the empirical positivity violation. We will study this hypothesis in a future article.
Another reason why NNs without regularization fail in the causal estimation and inference is that the networks are not targeted, and are not directly designed for these tasks. NNs are complex algorithms with strong predictive powers. This does not accurately serve the purpose of causal parameter estimation, where the empirical positivity assumption can be violated if strong confounders and/or instrumental variables [
22] exist in the data. Ideally, the network should target the confounders and should be able to automatically limit the strength of predictors so that the propensity scores are not extremely close to 1 or 0. This was not investigated in this article and a solution to this problem is postponed to another study.
In
Section 7, we applied the asymptotic standard errors of both AIPW and nAIPW, where the latter achieves smaller standard errors. That said, we acknowledge the fact that the asymptotic standard errors when using complex ML are not reliable and, in fact, they underestimate the calculated MC standard deviations, as illustrated in the simulations
Section 6. This is partly because of the usage of complex algorithms such as NNs for estimation of the nuisance parameters in the first step. Further, the asymptotic distributions of the estimators are not symmetric (and thus are not normal). However, nAIPW is more symmetric than AIPW, according to the simulations, while both estimators suffer from outliers. We will investigate the reasons and possible remedies for both the asymptotic distribution and standard errors of the estimators in a future paper. The consistency of the variance of nAIPW (and AIPW) relies on meeting the assumptions. More investigations are needed on how to achieve consistent and asymptotically normal estimators for ATE with a consistent variance estimator. Potential avenues can include proposing alternative estimators or improving the step 1 ML algorithms.