1. Introduction
In observational studies, the measured confounders can be controlled by a variety of methods such as propensity score based matching and regression adjustment. However, when the confounding variable is unmeasured, the traditional causal inference methods usually lead to biased estimators since changes in the unmeasured confounder will lead to changes in the explanatory variable, both of which will result in changes in the response variable. Failing to adjust such a confounder will lead to spurious association between the explanatory variable and the outcome. Analysis of instrumental variables (IV) has gained popularity in causal inference, such as investigating causal graphical structures [
1,
2] and controlling for unmeasured confounding [
3,
4]. An instrument is a variable that is correlated with the explanatory variable but not associated with any unmeasured confounders. In addition, the instrumental variable is supposed to have influence on the response variable only through the explanatory variable, i.e., there is no direct effect of this variable on the response. Instrumental variable analysis can be applied to many areas and disciplines, such as economics and epidemiology. For example, causality between the years of schooling and earnings in economics has been studied in the literature [
5]. This example exploits the college proximity as the instrumental variable because it is revealed that those living near college or university usually have significantly higher level of education than others. On the other hand, it is believed that college proximity may improve earnings only by increasing the subject’s years of schooling. Both indicate that college proximity is a useful instrumental variable. In biomedical and epidemiological research, the main interest is to investigate the causal effect of an exposure variable on a certain disease outcome. A gene can be assumed as a good instrument if it is closely linked to the exposure but has no direct effect on the disease [
6]. The study of genetic variants as instrumental variables is known as Mendelian randomization, which is discussed extensively in the literature (e.g., [
7,
8]). For instance, a set of 32 recently identified genetic variants are used as instrumental variables to study whether child fat mass causally affects academic achievement and blood pressure [
9].
1.1. Related Work
Since the development of instrumental variables, a plenty of instrumental variable estimation methods have been proposed for the causal effect estimation. Two-stage least squares (2SLS) [
10] is one of the most commonly used methods for the instrumental variable estimation. Theoretical analyses such as consistency and asymptotic normality also exist in the literature. When the response variable is binary, the second stage can be modified with logistic regression in mendelian randomization studies [
11]. Another method is the likelihood-based method, particularly the limited information maximum likelihood (LIML) [
12]. It is proved that the LIML method is more effective in dealing with the weak instruments [
13]. The phenomenon of weak instruments occurs when the correlation between the instrument(s) and the explanatory variable is close to zero. When there are weak instruments, 2SLS is generally unstable and the causal effect estimators are badly biased. The typical rule of thumb to detect weak instruments is the F-statistic, which states that an instrument may be weak if the first-stage F-statistic is less than 10 [
14].
Most of the IV approaches impose linear assumptions among the instrument, explanatory and response variables. However, this is not always the case. For example, a subject’s years of schooling may only have a positive effect on subsequent earnings if the subject obtained at least a high-school degree. There would be no difference in the earnings if the subject obtained either an elementary or middle school degree. In this hypothetical scenario, a linear regression model between the explanatory and response variables is clearly misspecified. When the null hypothesis of linearity relationship is rejected, one strategy could be to develop piecewise linear models, which is more interpretable compared to the completely nonlinear models.
In this paper, we propose a piecewise linear instrumental variable (PLIV) model for estimating the causal effect via a continuous threshold function. The continuous threshold function assumes that both the explanatory variable and the instrumental variable are continuous. Instrumental variable models with continuous variables have been studied extensively in the literature. For example, continuous instruments have been used in the classical IV models, developed in a structural equation modeling framework [
15]. A recent paper proposes semiparametric doubly robust estimators of causal effects with the continuous instruments [
16]. Moreover, some discussions about continuous exposure and a continuous response for Mendelian randomization can be found in a review paper [
8].
A threshold in a variable occurs when there is a sudden change in the values of this variable. We call the point where the change happens as a cut-off point or a threshold. The subset causal effect exists when there is a threshold in the explanatory variable. The proposed PLIV model is useful because it can study the subset causal effect when the true model is not linear and it can also degenerate to a linear instrumental variable model when the relationship among the variables is indeed linear. In other words, by using piecewise linear functions, we can quantitatively find the subset effects of the explanatory and the instrumental variables.
We use the Rectified Linear Unit (ReLU) function, mathematically defined in Equation (
1), to incorporate the piecewise relationships. Utilization of ReLU function for defining the subset effects have been studied in the literature, such as a regression kink model that tests the presence of the threshold [
17] and the segmented and hinge models to study the subset effects in logistic regression [
18]. Besides, the continuous threshold models via the ReLU function with two-way interactions is considered in the Cox’s proportional hazards model, where the asymptotic normality under mild conditions is established [
19]. In this paper, we use a continuous threshold function with multiple thresholds to formulate the piecewise linear instrumental variable models. A similar study of the piecewise linear instrumental variable model through the random slope approach is studied in the literature [
20]. It divides the data into a few segments and analyzes the data in each segment individually. However, this method suffers from huge efficiency and accuracy loss.
1.2. Contribution of This Article
In this paper, we consider a piecewise linear model when the linearity assumption of the data is inappropriate and provide a rigorous treatment of the statistical properties of the model. Our contributions can be summarized as follows.
We simultaneously estimate the coefficients and thresholds of the piecewise linear instrumental variable model by the limited information maximum likelihood (LIML) method, assuming the number of thresholds is known.
The proposed piecewise linear instrumental variable model will degenerate to the linear instrumental variable model if there are no thresholds. Therefore, it provides a generalization to the linear instrumental variable model. To our best knowledge, this is the first work on the piecewise linear extension to the traditional linear instrumental variable models.
We also study the theoretical properties of the PLIV model, including the consistency and asymptotic normality of the estimators.
2. Piecewise Linear Instrumental Variable Model
Notations: we denote scalars by unbolded lowercase letters (e.g., sample size n and the i-th observation of outcome variable ), random variable by unbolded capital letter (e.g., X), random vectors by boldface lowercase letters (e.g., and ), and matrices with boldface capital letters (e.g., ).
In the ordinary linear regression model , there is an assumption that the explanatory variables are uncorrelated with the error term, i.e., cov(,) = 0. However, there are some situations where the covariance between the explanatory variables and error term exists. This leads to inconsistent estimation of ordinary least squares due to the phenomenon of endogeneity in . One way to deal with this issue is to introduce an instrument variable, whose changes are related to changes in the explanatory variable but do not lead to the change in the response variable directly.
Let
, denotes the observed data for
, where
X is the explanatory variable,
Y is the response variable, and
Z is the instrumental variable. To estimate the subset causal effect and establish the piecewise linear relationship, for any threshold parameter
, we use a continuous threshold function which is defined as:
where
is an indicator function. ReLU function, commonly used as an activation function in deep learning, is a special case with
such that
.
The proposed model provides sparsity and computational efficiency compared to the smoothing or approximation approach in the literature. The estimation stage involves indicator functions but it does not require an approximation of the indicator function. Let
K and
J denote the number of thresholds in
Z and
X, respectively. Denote
as the vector of thresholds in
Z and denote
as the vector of thresholds in
X. We propose the following piecewise linear instrumental variable model:
where
is the vector of coefficients representing the causal effect of
X on
Y;
is the vector of coefficients representing the instrumental effect of
Z on
X;
and
are the error terms for the
ith observation. In the context of causal inference, we interpret
as the causal effect of
x on
y. More specifically, for
with
denoting the maximum value of
x, one unit increase in
x leads to
units change in
y. Besides,
represents the change in
y that is caused by one unit increase in
x for
where
is the minimum value of
x. To better understand this, in
Figure 1, we plot the function
where
as an example. When
, the slope is
. When
, the slope is
.
Here, we assume
K and
J are prespecified according to some prior knowledge or theoretical justifications. Practically, we may use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [
21] to select them. A more elegant examination of the condition for the number of thresholds can be found in Newey [
22]. In particular, when
and
, our proposed model degenerates to the traditional linear instrumental variable model.
For instrumental variable analysis, an instrumental variable is correlated with the explanatory variable but not correlated with the error term. In our model, is the vector of instrumental variables with the following properties:
Instrument relevance: : is correlated with the explanatory variable X.
Instrument exogeneity: : is uncorrelated with the error term U.
We assume for identifiability, i.e., the number of instruments should be larger than or equal to the number of endogenous variables.
Remark 1. Note that intensive research about nonlinear instrumental variable models has been conducted in the literature, such as the nonparametric instrumental regression [23,24,25]. We point out that the target of our method is to quantitatively find the thresholds and estimate the subset causal effects. We aim to generalize the traditional linear IV model and fit an interpretable model rather than approximate the data by a nonlinear function. To estimate the unknown parameters in (
2) and (
3), we utilize the two-stage least square (2SLS) method and the limited information maximum likelihood (LIML) method. Details about the proposed estimation methods are discussed below.
3. Simultaneous Maximum Likelihood Estimation
We first introduce how the LIML method is used in our model and initialize the naive estimators by the 2SLS method.
3.1. Limited Information Maximum Likelihood
As discussed in the introduction about the advantages, limited information maximum likelihood is another popular approach for estimation in the instrumental variable models. Here, we assume the error terms
are jointly normally distributed and correlated to some extent due to the unmeasured confounding effect. Let
be the zero-mean vector and
be the correlation of
. Denote
and
as the variance of the error terms
U and
V, respectively. Then the probability density function of the bivariate normal (U,V) can be written as:
where the quadratic form
. For a single observation, the log-likelihood is
where
denote all the model parameters and
To simplify notations, we let denote the log-likelihood. The maximum likelihood estimates for is obtained by maximizing the log-likelihood within the compact set such that , where . However, there is no closed-form solution for , so we take the gradient-based algorithm for estimation. This yields approximate M-estimators. To speed up estimation, we use the two-stage least square method to initialize the estimators.
3.2. Initialization: Two-Stage Least Square
The traditional two-stage least squares method regresses the explanatory variable on the instrumental variable and computes the predictions in the first stage. In the second stage, it regresses the response variable on the predictions . The causal effect of interest is estimated from the second stage. In our method, we employ 2SLS to obtain the initial values of the parameters of the piecewise linear instrumental variable model. Below we describe the 2SLS procedures for initializations:
Stage 1: First, we regress x on and then obtain the fitted values , where .
Stage 2: We regress
y on
, where
. Thus, in the second stage, we fit the following regression model:
For each combination of the number of thresholds in X and Z, we could pick , and the regression coefficients simultaneously through grid search when the sum of squared errors (SSE) of Y is minimized. However, for or , it is slightly computationally expensive to conduct grid search. Since we only need 2SLS to provide the initialization of the parameters in our method, we choose to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of Z. Similarly, we choose to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of X. We ignore points below and above the 5% to 95% quantiles in order to avoid boundary effects. The regression coefficients are obtained accordingly.
3.3. Theoretical Analysis
Under mild conditions, we study the statistical properties of the proposed model and establish the robust variance-covariance estimators for the estimated parameters under the correctly specified and misspecified models, separately. To investigate the theoretical properties, we consider the following regularity conditions:
C1. Observations are independently and identically distributed on a compact set . Furthermore, , , and .
C2. The explanatory variable
X and the instrumental variable
Z are continuous in the parameter space, i.e., they have continuous probability density functions
and
. The density functions are uniformly bounded, that is, there exist constants
,
,
, and
such that
Furthermore, the true value of the coefficients for the threshold effects satisfy and , where and .
C3.
is upper-semicontinuous for almost all
, that is, for every
,
Remark 2. Condition C1 is commonly used in regression models. Condition C2 is used for estimating the unknown thresholds and ensures the model is identifiable. The continuity requirements of X and Z are used to estimate the thresholds. Condition C3 is used to establish the consistency and the asymptotic normality of the maximum likelihood estimator.
In terms of estimation, we take the gradient-based method which depends on the first order derivative
(details can be found in
Appendix A) with the initialized estimators by 2SLS. In this paper, we do not approximate the indicator function by the logistic function as some researchers do (e.g., [
18,
26,
27]). The gradient-based algorithm for the ReLU function has shown success in the context of deep learning and machine learning. Compared to the approximation techniques as discussed in
Section 1, model estimation with the ReLU function is computationally cheaper since no approximation of the indicator function is required. In fact, as long as Condition C2 is satisfied which requires variables X and Z to be continuous, the gradients composed of the indicator functions converge to a continuous function of the threshold parameters as
, for example,
for
by the law of large numbers. Therefore, the second order derivative of the ReLU function with respect to the thresholds can be derived based on the resulting continuous probability function. More specifically, the second order derivative with respect to
is simply
.
To prove the asymptotic normality, we first need to show the consistency of the proposed estimators.
Theorem 1. Under conditions C1–C4, assume that
is compact and the true parameter vector is unique. Furthermore, for every sufficiently small ball , is measurable with , then
Proof. The proof follows the Theorem 5.7 of van der Vaart [
28]. For completeness, we include it as Theorem A1 in
Appendix B. To utilize Theorem 5.7, we need to check the condition that
for some
. This is true since
is continuous in
,
converges to
uniformly, and
(approximately) maximizes
. Thus, all the conditions are satisfied and the result follows. □
Theorem 2. Under conditions C1–C4, let be the true value of . Let be a measurable function with for , where denotes the number of elements in , then where and is the first order derivative of with respect to evaluated at and is the second order derivative of with respect to evaluated at (derivations in Appendix A). has the form where denotes a zero vector or a zero matrix and 0 denotes a scalar. Details of and are given in the Appendix A. Proof. First, note that
is Lipschitz continuous in
. Moreover, the fact that
is continuous in
admits the Taylor expansion of
:
Since
is the maximum likelihood estimate of
,
. Plus the result from Theorem 1 that
, we conclude from Theorem 5.14 of van der Vaart [
28] that:
which implies an asymptotic normal distribution with mean
and variance-covariance matrix
. □
For completeness, we include Theorem 5.14 of van der Vaart [
28] (2000) as Theorem A2 in
Appendix B. When the model is correctly specified,
, the asymptotic variance is the inverse of Fisher information. Matrices
and
are estimated through the replacement of
by the MLE
. Thus, for the correctly specified model, the variance-covariance matrix is estimated by the inverse of
. For the misspecified model, the variance-covariance matrix is estimated by
. Let us define
as the second derivative of
with respect to
, then we can decompose
the same way as
into two matrices
and
. Note that
is the empirical process of
and
by the law of large numbers, so we use the estimated probability densities
and
for
and
for
and
, respectively.
4. Simulation Studies
In this section, we evaluate the performance of the proposed model using simulated datasets. We consider two scenarios with the same sample size
. We let error terms
U and
V be jointly normally distributed with mean
and correlation
. Here, we consider a common standard deviation
. Besides, we simulate the instrumental variable
. The first scenario has one threshold in
X and one threshold in
z, and it takes the following form:
The true values of the parameters in PLIV models are
,
,
, and
. The second scenario has two thresholds in
x and two thresholds in
z, and it takes the following form:
The true parameters are
,
,
, and
. We show the simulated piecewise linear instrumental variable models for scenario 1 and scenario 2 in
Figure 2. We replicate the simulation 1000 times to evaluate the finite sample properties of the proposed model by the PLIV method.
Table 1 summarizes the biases, standard errors of
and coverage probabilities of
by the proposed PLIV method for scenario 1, where
is the theoretical standard error and
is the empirical standard error. As we can see in the table, all the biases of
are close to zero. We also find that the theoretical standard error and the empirical standard error are close enough, which confirms the validity of our theoretical results in
Section 3. The results show that our model estimation is quite accurate and therefore provides unbiased and consistent estimators. Besides, we notice that the coverage probabilities are around 95% under different values of
. Moreover, biases and the standard errors decrease as we increase
because the instrumental variables becomes stronger.
Table 2 summarizes the biases, standard errors of
and 95% coverage probabilities of
by the PLIV method for scenario 2, where
is the theoretical standard error and
is the empirical standard error. We find the similar patterns as in
Table 1 from scenario 1. For instance, all the biases are small. Theoretical standard errors and the empirical standard errors are close to each other. Most coverage probabilities are around 95% when
and
. We also observe that the coverage probabilities of the thresholds are slightly low when
. The reason might be due to the high correlation between errors. With multiple thresholds and high correlation, it poses challenges to estimate the exact locations.
We include results with a sample size of 1000 in
Appendix C, while fixing
. Overall, as
n increases, we observe that both biases and standard errors drop.
5. Application
In this section, we revisit the Card’s education data [
5]. We apply the proposed model to study the causal effect of years of schooling on hourly wage in cents with father’s years of schooling as the instrumental variable. The interest here is to find a threshold and study the threshold effect of the years of schooling. It is generally believed that a child’s years of schooling has a direct effect on the child’s wage and parents’ education only affects the child’s income by affecting the child’s education level. In other words, parents’ education level has no direct effect on child’s wage. Therefore, the father’s years of schooling can be treated as a valid instrumental variable.
In Card’s data, we remove the missing values and include a total of
observations. The explanatory variable
X (child’s years of education) is between 1 and 18 with median 13, and the instrumental variable
Z (father’s years of education) has minimum 0, maximum 18, and median 12.
Figure 3 indicates that variables
X and
Y are skewed and have heavy tails so transformations are needed before the analysis. A log transformation is applied to both.
Table 3 shows the point estimate, standard error, and associated 95% confidence interval of
by the proposed model with
and
, which are selected by BIC. In the table,
and
c are the coefficient and threshold for the transformed father’s years of schooling, respectively.
is the causal effect of years of schooling on earnings. The estimated causal effect of interest
is 0.87, which results in a difference of
units increase in wage if there are
a units increase in the log of years of schooling. In economics,
is interpreted as “elasticity". That is, if years of education increases by 1%, the person’s income will increase by 0.87% by our estimation. In terms of the instrumental variable, we notice that the threshold
c is estimated to be 7.86. The corresponding p-value is not calculated since testing
is meaningless in this context. It shows that there exists a threshold at around 8 in the father’s years of schooling. That is, the father’s years of schooling only has a positive effect on the child’s years of schooling if father receives at least 8 years of education. This information can not be observed if the traditional 2SLS method or nonparametric approaches are applied to analyze the data. The threshold effect as well as the thresholds are all statistically significant since their corresponding p-values are far less than 0.05.
6. Discussion, Limitations, and Future Research
In this paper, we propose a simultaneous maximum likelihood estimation for a piecewise linear instrumental variable model. We use the two-stage least square estimators as the initial values and the limited information maximum likelihood methods to estimate the regression coefficients and the threshold parameters simultaneously. We also provide a robust inference of the proposed model. The proposed model with the piecewise linear functions allows us to find the thresholds for both the explanatory and the instrumental variables, which generalizes the traditional linear instrumental variable models. In the simulation study, we evaluate the performance of the proposed model and find that it behaves well in terms of the biases, standard errors, and coverage probabilities in different settings.
In our model, we include a single continuous explanatory variable and a single continuous instrumental variable. We assume the explanatory variable and the instrumental variable are continuous. More complicated cases can be considered. For example, developing a piecewise linear model with count data might be interesting. However, finding the optimal number of thresholds as well as the locations is challenging from the theoretical side. Furthermore, we assume the number of thresholds K and J are prespecified. Treating the numbers of thresholds as random variables, finding the optimal values, and investigating the theoretical properties can be future research.