1. Introduction
The average treatment effect is widely used in the measurement of causal inference [
1]; however, it is not the only measure. For example, in a typical randomized Phase III clinical trial, there are patients who benefit from a negative trial and patients who do not benefit from a positive trial. In this situation, the average treatment effect can not explain the casual effect completely, since it ignores the heterogeneous responses to the treatment in the target population. Some researchers made additional assumptions to address this heterogeneity. One of the main assumptions is “Monotony” [
2], which assumes that the treatment effect for each individual will be no worse than the control effect. There are many scientific and empirical reasons to doubt this assumption. Ref. [
3] proposed several explanations for the fact that some people respond to an inactive control group, but do not respond to an experimental treatment, and note that for some people, a placebo has been shown to be superior to active treatment.
For this reason, we focus on the measurement of treatment benefit rate (TBR) and treatment harm rate (THR) in our paper. Ref. [
4] tried to identify TBR and THR by making the additional assumption that the two potential outcomes were independent, conditional on observed covariates. Ref. [
5] estimated the TBR and THR assuming the existence of at least three covariates, which are mutually independent. Ref. [
6] proposed a Bayesian-tree-based latent variable model to seek subpopulations with distinct TBR. Under the assumption that the potential outcomes are independent conditional on the observed covariates and an unmeasured latent variable, ref. [
7] showed the identification of the TBR and THR in non-separable (generalized) linear mixed models for both continuous and binary outcomes. In our article, we follow the assumption in [
4] to make the TBR and THR identifiable.
However, although our experiment is based on the assumption of a randomized experiment, we still need to face a problem in the process of identifying TBR and THR, that is, there may be missing data in the pretreatment covariate, potential endpoint, or treatment assignment. When one of the variables has missing data, TBR and THR cannot be identified; we can only get the upper and lower bounds for parameters of interest, rather than point estimates, and the upper and lower bounds are too wide to use. When the missing data mechanism is ignorable, we can ignore observations with missing data and identify TBR and THR directly. In many cases, however, missing data is not ignorable, that is, the processing of missing data depends on some possibly missing variables. In this case, TBR and THR cannot be identified without other assumptions. For example, in a randomized clinical trial [
8], the covariate is obtained from electrophysiological stimulation (EPS) testing. Because the EPS testing is invasive and not a prerequisite for enrollment in the study, 79.3% of patients in the implantable cardiac defibrillator arm have EPS records, whereas only 2.4% of patients in the control arm have EPS records. Therefore, the missing data problem for the covariate is very severe and nonignorable. Another example is the Awakening and Breathing Controlled trial [
9]. In this trial, because of the possibility of the patients’ death, there are nonignorable missing data in the cognitive score at 3 months and 12 months. In a total of 187 patients, there were 111 missing values for the cognitive score at 3 months and 136 missing values for the cognitive score at 12 months, and they are nonignorable. Because of the missing data, we can not identify the TBR and THR directly.
There are many examples in the literature where the problem of missing data in causal inference has been studied. Refs. [
8,
10] used sensitivity analysis in the nonignorable missing covariates problem. Refs. [
11,
12,
13] studied the identification problem when the missingness of the outcomes was nonignorable. Refs. [
14,
15,
16] discussed the identifiability of causal effects when a key covariate is missing due to death. Refs. [
17,
18] also discussed nonignorable missing covariates problems in survival analysis and regression models. In our paper, we deal with the case that one of the treatments, covariate, and the endpoint have missing data. We will give some basic assumptions and special conditions for pretreatment covariates, potential endpoints, and treatment, under which we can identify the TBR and THR. These assumptions and conditions have certain wide applicability.
The rest of this article is as follows. In
Section 2, we introduce the notation and assumption used throughout this article. In
Section 3, we introduce several missing mechanisms of covariate, endpoint, and treatment. In
Section 4, we discuss the identifiability of TBR and THR under these missing mechanisms. In
Section 5, we estimate TBR and THR using the EM algorithm in simulation studies when they can be identified. In
Section 6, we analyze datasets from clinical trials by our methods. Lastly, we put the proofs of theorems in the
Appendix A.
2. Notation and Assumption
Let Z denote the treatment assignment. means treatment, and means control. We assume that there is only one covariate, and let X denote the pretreatment covariate with K categories (). Suppose the K levels of X are , , … . Let Y donate endpoints, and suppose Y is binary. means that the treatment or control works. means that the treatment or control does not work. We assume as the potential endpoint under control and as the potential endpoint under intervention. Then, the observed endpoint Y can be written as . In our article, we assume that one of X, Y, Z is missing. Let denote the missing data indicator for X and denote the potential missing data indicator for X, denotes the missing indicator for Y and denotes the potential missing data indicator for Y, denotes the missing indicator for Z. Because Z is the treatment, we do not write the potential variable of . We can only observed one of the pairs and . means X is missing, means X is observed. means Y is missing, means Y is observed. means Z is missing. means Z is observed. For X, Y, Z, we assume that only one of them is missing at the same time, which means that one of , , and may be 1, and the other two variables are constant 0.
Ref. [
4] defines TBR (treatment benefit rate) as the proportion of the relevant population that benefits from the intervention as compared with the control for a given endpoint. THR (treatment harm rate) is defined as the proportion that is harmed by the intervention as compared with the control based on the same endpoint. Thus, we can use the following equations to describe TBR and THR:
When TBR is much larger than THR, we can say that this treatment is beneficial. On the contrary, when THR is much larger than TBR, we can say that this treatment is harmful.
Let denote that variables A and B are conditionally independent, given variable C. To identify TBR and THR, we need the following assumptions.
Assumption 1 (Complete randomization). .
When the experiment is a completely randomized experiment, its data set is subject to this assumption. It means in a completely randomized experiment, the treatment assignment Z is independent of . This assumption is very strong, and all the theory and methods discussed in this paper are subject to this assumption. Under this assumption, we can get the following equation: .
This assumption means that when the pretreatment covariate X is given, the potential endpoints are independent of each other, that is, given the covariate X, cannot predict and cannot predict .
We aim to identify the TBR and THR through the observed data. If there are no missing data, under the Assumptions 1 and 2, the TBR and THR can be split into the product of two conditional probabilities based on the observed data. Let
and
denote the treatment benefit rate and treatment harm rate, given
X. Then, we have:
The above equations illustrate that the TBR and THR can be identified under Assumptions 1 and 2 without missing data. However, when there are missing data in one of the covariate, endpoint, or treatment variables, Assumptions 1 and 2 are not enough to ensure the identification of the TBR and THR, and the above formula no longer works. In this paper, we give sufficient conditions to identify the TBR and THR when one of have missing data.
Lastly, we introduce the following assumption.
Assumption 3. When X has missing data, . When Y has missing data, . When Z has missing data, .
We need this assumption to ensure that the missing variable is only partially missing.
3. Missing Data Mechanisms
In our article, we study the TBR and THR when one of
X,
Y,
Z have missing data. Before introducing the specific missing mechanisms, we will review the definition of missing at random (MAR) and missing not at random (MNAR) first (Little and Rubin, 2002 [
19]).
denotes the complete data,
denotes the observed data, and
denotes the missing data; therefore, we have
= (
,
). Next, we introduce the two missing mechanisms mentioned above.
Definition 1. The missing data mechanism is called missing at random (MAR), if , , or only depend on the observed data, that is, one of the following three formulas, , , or , holds. Otherwise, if , , or depend on , the missing data mechanism is called missing not at random (MNAR).
When the missing indicators only rely on the observed data (), because the missing mechanism does not depend on the missing data, the inference for parameters can be based only on the observed data and we call it missing at random (MAR). When the missing is not at random, it is nonignorable. In such a case, we cannot ignore the missing data.
In this article, we study the TBR and THR when one of X, Y, Z is under the condition of MNAR. The missing mechanisms of X, Y, or Z are important because they influence the identifiability and estimation of the TBR and THR. For each variable in X, Y, Z, we propose three missing mechanisms.
First, we introduce three missing mechanisms of X.
(
)
depends on
X and
is independent of
, given
X, which means:
(
)
depends on
and
is independent of
Y, given
, which means:
(
)
depends on
and
is independent of
Z, given
, which means:
For the first missing mechanism of X, we assume that the missingness of X depends only on X. For the second missing mechanism of X, we assume that the missingness of X depends on . For the third missing mechanism of X, we assume that the missingness of X depends on . All the missing mechanisms are nonignorable and these missing mechanisms cannot be deduced from each other.
Similarly, we introduce the following several missing mechanisms of Y.
(
)
depends on
Y and
is independent of
, given
Y, which means:
(
)
depends on
and
is independent of
X, given
, which means:
(
)
depends on
and
is independent of
Z, given
, which means:
Analogously, for the first missing mechanism of Y, we assume that the missingness of Y depends only on Y. For the second missing mechanism of Y, we assume that the missingness of Y depends on . For the third missing mechanism of Y, we assume that the missingness of Y depends on . All the missing mechanisms are also nonignorable.
Lastly, we are going to introduce the missing mechanism of Z.
(
)
depends on
Z and
is independent of
, given
Z, which means:
(
)
depends on
and
is independent of
X, which means:
(
)
depends on
and
is independent of
Z, which means:
Above, we introduced three missing mechanisms of Z. For the first missing mechanism of Z, we assume that the missingness of Z depends only on Z. For the second missing mechanism of Z, we assume that the missingness of Z depends on . For the third missing mechanism of Z, we assume that the missingness of Z depends on .
The missing mechanisms mentioned above are all MNAR. The above-mentioned missing mechanisms for X, Y, Z all assume that the missing variable satisfies some conditional independent relationship. In the next section, we consider whether TBR and THR can be identified under these missing mechanisms.
4. Identifiability of TBR and THR
In this section, we discuss the identifiability of TBR and THR when one of X, Y, and Z have missing data. In some mechanisms, we have to identify the joint distribution of , , or to ensure the identifiability of the TBR and THR. We assume the following theorems are under the Assumptions 1 and 2. Before introducing the theorems, note that X has K levels, and Y and Z are both binaries.
Firstly, we give sufficient conditions under which we can identify the TBR and THR when covariate X has missing data.
Theorem 1. For the missing of X:
- (1)
Under the missing mechanism , the TBR and THR are identifiable when , where and are two matrices and the definitions of and are mentioned below, and is the rank function.
- (2)
Under the missing mechanism , the TBR and THR are identifiable when and .
- (3)
Under the missing mechanism , the TBR and THR are identifiable when and .
When
X has missing data, under different missing mechanisms, the identification conditions are also different. Under the first missing mechanism, if we want to identify the THR and THR, we need to assume that the rank of matrix
and
is
K.
and
are defined as follows.
where
. Note that
is a matrix with 4 rows and
K columns, and
is a matrix with 4 rows and
columns. If the rank of
and
is required to be equal to
K,
K must be less than or equal to 4. Under the second missing mechanism, if the covariate and endpoint are not conditionally independent, given
and
Z, and the covariate only has two levels, we can identify the TBR and THR. Under the third missing mechanism, if the covariate and treatment are not conditionally independent, given
and
Y, and the covariate only has two levels, we can also identify the TBR and THR.
Next, we give sufficient conditions under which we can identify the TBR and THR when endpoints Y have missing data.
Theorem 2. For the missing of Y:
- (1)
Under the missing mechanism , the TBR and THR are identifiable under the condition , where and are two matrices and the definitions of and can be found in the appendix, and is the rank function.
- (2)
Under the missing mechanism , the TBR and THR are identifiable under the condition , where , , , and are matrices, and the definitions of , , , and can be found in the appendix, and is the rank function.
- (3)
Under missing mechanism , the TBR and THR are identifiable under the condition .
When
Y has missing data, we cannot get a uniform identifiable condition. Under different missing mechanisms, it requires different conditions to ensure the identification of the TBR and THR. Under the first missing mechanism, if we want to identify the THR and THR, we need to assume that the rank of the matrix
and
is 2.
and
are defined as follows.
where
. Under the second missing mechanism, if we want to identify the THR and THR, we need to assume that the rank of matrix
,
,
, and
is 2.
,
,
, and
are defined as follows.
where
. Under the last missing mechanism, we can identify the TBR and THR if
Y and
Z are not conditionally independent, given
and
.
Lastly, we give sufficient conditions under which we can identify the TBR and THR when treatment Z has missing data.
Theorem 3. For the missing of Z:
- (1)
Under the missing mechanism , the TBR and THR are identifiable under the condition , where and are two matrices, and the definitions of and can be found in the appendix, and is the rank function.
- (2)
Under the missing mechanism , the TBR and THR are identifiable under the condition , where , , , and are matrices, and the definitions of , , , and can be found in the appendix, and is the rank function.
- (3)
Under the missing mechanism , the TBR and THR are identifiable under the condition .
When
Z has missing data, we also cannot get a uniform identifiable condition. Under different missing mechanisms, it requires different conditions to ensure the identification of the TBR and THR. Under the first missing mechanism, if we want to identify the THR and THR, we need to assume that the rank of matrix
and
is 2.
and
are defined as follows.
where
. Under the second missing mechanism, if we want to identify the THR and THR, we need to assume that the rank of matrix
,
,
, and
is 2.
,
,
, and
are defined as follows.
where
. Under the last missing mechanism, we can identify the TBR and THR if
Y and
Z are not conditionally independent, given
and
.
The above three theorems give sufficient conditions under which the TBR and THR can be identified. In the next two parts, we illustrate my conclusion through simulation and actual data.
5. Computational Details and Simulation Study
In this part, we first introduce how to use the EM algorithm to estimate the TBR and THR when covariate X has missing data and satisfies missing mechanism . When X satisfies other missing mechanisms or there are missing data in the other two variables, the estimation is similar. Next, we generate simulation data and then apply our method to the simulation data to illustrate that our estimation works. We use statistical software R to implement our numerical simulation.
5.1. Expectation Maximization Algorithms
We define and , , where “+” represents the marginal distribution over corresponding variable. Similarly, let denote the observed frequency in the cell of the contingency table, and denote the marginal frequency of the contingency table over the corresponding variable X. When “+” is at another position, its meaning is the same.
In practice, we can use the expectation maximization (EM) algorithm to find the MLEs. In this subsection, we only describe the computational details for missing mechanism
. For simplicity, we only describe the algorithms for binary
X. The algorithms for multi-categorical
X can be written similarly. Under the missing mechanism
, we have
. Thus, the joint distribution of
can be written as
. Superscript
j indicates the
j-th iteration. Define:
The EM algorithm iterates between the following E-step and M-step:
- (a)
E-step: The sufficient statistics are imputed as and ;
- (b)
M-step: The joint distribution is updated by .
After the algorithm converges, we assume that the convergent probability is
. According to the formula in the second section, we can estimate the TBR and THR as follows.
Lastly, we calculate the standard errors of the above estimator by repeating the processes 1000 times.
5.2. Simulation Study
In this section, we evaluate the finite sample performances of the likelihood-based estimator for the missing mechanisms and via simulation studies. In order to mimic the real data analyzed in the next section, we assume that Z is completely randomized and . We generated and . is defined, and Y is generated according to the conditional distribution . We set the parameters of the two missing mechanisms as follows.
We use the EM algorithm to find the MLEs of the parameters and calculate the corresponding THR and TBR. The sample sizes of the simulation study are 500, 1000, and 1500, respectively, and we repeat the simulation 1000 times. The means and the standard errors of the estimates of the TBR and THR are given in
Table 1 and
Table 2.
We can see from the simulation results that the values of TBR and THR can be estimated consistently, which means that the TBR and THR are identifiable. With the increase of sample size, the standard deviation decreases gradually.
6. Application
In this part, we illustrate the correctness of our method with three real data examples.
6.1. Application to the Second Multicenter Automatic Defibrillator Intervention Trial
In this section, we re-analyzed a randomized clinical trial using the newly proposed methods under the missing mechanism
. We first briefly review the background of the illustrative clinical trial, and more details of the data can be found in the previous paper ([
8]). In this example,
Z is the treatment assignment variable, with
denoting the treatment (implantable cardiac defibrillator) and
denoting the control. The endpoint
Y is the death indicator, with
denoting dead and
denoting alive. Let
X denote the inducible indicator, with
denoting inducible and
denoting noninducible. The covariate
X is obtained from the electro-physiological stimulation (EPS) testing. Because the EPS testing is invasive and not a pre-requisite for enrollment in the study,
of patients in the implantable cardiac defibrillator arm have EPS records, whereas only
of patients in the control arm have EPS records. Therefore, the problem of missing data for the covariate
X is very severe. The observed data can be summarized as the following counts (
):
,
,
,
,
,
,
,
,
,
,
, and
. We assume that the missing mechanism of
X is
. Firstly, we use the EM algorithm to calculate the maximum likelihood estimation of the parameters and then calculate the
and
. Then, the sampling is repeated 1500 times to calculate the standard deviation of TBR and THR. The estimated TBR and THR are
and
. The numbers in brackets indicate the standard deviation.
6.2. Application to the Mechanical Treatment Trial for Crisis Patients
In this section, we will re-analyze a randomized clinical trial using the newly proposed methods under missing mechanism
. We first briefly review the background of the trial ([
9]). In this example, critically ill patients randomly received mechanical ventilation 1:1 within each study site to manage with a paired sedation plus ventilator weaning protocol involving the daily interruption of sedative through spontaneous awakening trials (SATs) and spontaneous breathing trials (SBTs) or sedation per usual care (UC) and SBTs.
Z is the treatment assignment variable, with
denoting the treatment (SAT and SBT) and
denoting the control (UC and SBT). The endpoint
Y is the cognitive score, with
denoting “higher cognitive ability” and
denoting “lower cognitive ability”. Let
X denote age, with
denoting “the people older than 33 years old” and
denoting “the people younger than 33 years old”. In randomized studies involving severely ill patients, functional endpoints are often unobserved due to missed clinic visits, premature withdrawal, or death. The observed data can be summarized as the following counts (
):
,
,
,
,
, and
. We assume that the missing mechanism of
Y is
. Similarly, we use the EM algorithm to calculate the maximum likelihood estimation of the parameters and then calculate the
and
. Then, we use the bootstrap method to repeat sampling 1500 times to calculate the standard deviation of the TBR and THR. The estimated TBR and THR are
and
. The numbers in brackets indicate the standard deviation.
6.3. Application to the Job Search Intervention Study
In this section, we will analyze a randomized trial using the proposed methods under missing mechanism
. Firstly, we will introduce the background of the data. The Job Search Intervention Study (JOBS II) was a randomized field experiment that investigated the efficacy of a job training intervention on unemployed workers ([
20]). There are 899 unemployed workers in the “jobs” dataset. All the workers were randomly assigned to two groups, the control group (people received a booklet describing job-search process) and the treatment group (people participated in job skills workshops); the binary endpoint represents whether the respondents had become employed.
Z is the treatment assignment variable, with
denoting the treatment (people participated in job skills workshops) and
denoting the control (people received a booklet describing job search process).
Y denotes the endpoint;
denotes that the worker became employed finally, that is, the treatment worked; and
denotes that the worker was still unemployed. Additionally,
X denotes sex, with
for female and
for male. The observed data can be summarized as the following counts (
):
,
,
,
,
,
,
,
. Based on this data, we assume
and manually generate missing data. The generated data can be summarized as the following counts (
):
,
,
,
,
. Similarly, we use the EM algorithm to calculate the maximum likelihood estimation of the parameters and then calculate the
and
. We use the bootstrap method to repeat the sampling 1500 times to calculate the standard deviation of the TBR and THR. The estimated TBR and THR are
and
. The number in brackets indicates the standard deviation.
7. Discussion
In the field of causal inference, the average causal effect is an important measure, but this measure is also flawed. Its flaw is that it ignores the heterogeneous responses to the treatment in the target population. Therefore, in this article, we study the TBR and THR proposed by [
4]. In addition, in randomized experiments, the existence of missing data is a common phenomenon [
21], so we assume that there are missing data in one of the covariate, endpoint, or treatment. We give sufficient conditions to make the TBR and THR identifiable in the presence of missing data. We illustrate our method through simulated data, and then apply our method to several actual data.
There are several issues beyond the scope of this paper. First, in Assumption 2, we assume that given a covariate
X, the two potential variables
and
are conditionally independent. This assumption also appeared in [
4]. However, we can only observe one of the two potential variables, and the other one cannot be observed, which means that Assumption 2 cannot be verified by the data. Thus, it is better to propose a more appropriate assumption to ensure that TBR and THR can be identified.
Second, in our article, we assume that the covariate X in Assumption 1 is a binary one-dimensional variable. However, in practice, X may be a continuous variable or high-dimensional variable, and there may also be unobservable variables in X. In this case, even if there are no missing data, it is very difficult to identify the TBR and THR because the observations in each subgroup may be very sparse in a limited sample. If there are still missing data, we need to propose new conditions so that the TBR and THR can be identified.
Third, we discussed the situation where only one of the covariate, endpoint, and treatment variables may be MNAR. In many applications, both the covariate and the endpoint may be MNAR at the same time. In this case, the identification and estimation of the TBR and THR will be more complicated.
Although the problems mentioned above are beyond the scope of this article, we will continue our research in this area.