Next Article in Journal
Potential Involvement of LncRNAs in Cardiometabolic Diseases
Previous Article in Journal
The Complete Mitochondrial Genome and Gene Arrangement of the Enigmatic Scaphopod Pictodentalium vernedei
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Support Interval for Two-Sample Summary Data-Based Mendelian Randomization

Department of Biostatistics, University of Iowa, 145 N Riverside Dr., Iowa City, IA 52242, USA
Genes 2023, 14(1), 211; https://doi.org/10.3390/genes14010211
Submission received: 27 November 2022 / Revised: 31 December 2022 / Accepted: 4 January 2023 / Published: 13 January 2023
(This article belongs to the Topic Big Data in Healthcare, Bioinformatics and Precision Medicine)
(This article belongs to the Section Bioinformatics)

Abstract

:
The summary-data-based Mendelian randomization (SMR) method is gaining popularity in estimating the causal effect of an exposure on an outcome. In practice, the instrument SNP is often selected from the genome-wide association study (GWAS) on the exposure but no correction is made for such selection in downstream analysis, leading to a biased estimate of the effect size and invalid inference. We address this issue by using the likelihood derived from the sampling distribution of the estimated SNP effects in the exposure GWAS and the outcome GWAS. This likelihood takes into account how the instrument SNPs are selected. Since the effective sample size is 1, the asymptotic theory does not apply. We use a support for a profile likelihood as an interval estimate of the causal effect. Simulation studies indicate that this support has robust coverage while the confidence interval implied by the SMR method has lower-than-nominal coverage. Furthermore, the variance of the two-stage least squares estimate of the causal effect is shown to be the same as the variance used for SMR for one-sample data when there is no selection.

1. Introduction

A main interest in scientific research is to study the causal effect of an exposure x on an outcome y. When the outcome is continuous, the causal effect is the coefficient b in the following regression model:
y = b x + u + ϵ y ,
where u represents the unobserved factors and ϵ y is normally distributed with mean 0 and variance σ y 2 . Throughout this report, all variables are centered so that the intercept is equal to 0.
When u confounds the effect of x, the least squares estimate of b is biased. Mendelian randomization (MR) is a modern technique for correcting this bias [1,2,3,4,5], thanks to the availability of the large number of genome-wide association studies (GWASs). One appealing feature of the summary-data-based MR methods is that they don’t rely on individual-level data.
MR is an application of instrumental variable (IV) analysis to estimate b. IV analysis is able to control for unobserved confounders. MR uses single nucleotide polymorphisms (SNPs) as IVs. Let g denote the genotypic score of an SNP. For an IV to be valid, it must satisfy the following assumptions [6]:
Relevance: 
It is associated with the exposure x (i.e., C o v ( g , x ) 0 );
Exclusion Restriction: 
It affects the outcome y only through its association with the exposure; and
Exchangeability: 
It is not associated with any confounders of the exposure–outcome association, which implies C o v ( g , y ) = b C o v ( g , x ) .
Under these assumptions, Equation (1) implies:
b g y = b b g x ,
where b g y = C o v ( g , y ) / V a r ( g ) and b g x = C o v ( g , x ) / V a r ( g ) . Since b g x and b g y can be estimated from the exposure GWAS and the outcome GWAS, respectively, a popular summary-data MR (SMR) estimate of b is [2]:
b ^ SMR = b ^ g y b ^ g x ,
where b ^ g y and b ^ g x are GWAS estimates of b g y and b g x , respectively.
In practice, in order to satisfy the relevance assumption on an IV, an SNP is typically selected from the exposure GWAS, often at the genome-wide significance level p < 5 × 10 8 . Hence the selected SNPs are subject to a winner’s curse that leads to a biased effect estimate. This is an issue that has been recognized in the MR literature for some time [7,8,9,10,11,12,13]. A typical solution is to use another GWAS on the exposure to screen for IV SNPs [10,14,15]. However, such a GWAS may not always be available. A simple correction is to transform the false discovery rate to the z-scale [11]. An empirical study on the effect of the winner’s curse on a Mendelian randomization study is presented in [12]. A review of methods to overcome the winner’s curse in the context of genetic association studies is provided in [13].
As a matter of fact, many applications [2,16,17], including those contained in the original paper that proposed SMR [2], do not use another GWAS for screening. The IV SNPs are simply selected from the exposure GWAS and the selection bias is not corrected for in the downstream analysis [2,18,19]. Furthermore, this approach has been generalized to other settings [20,21].
This research is based on the sampling distribution of the estimated SNP effects on the exposure and on the outcome. Despite the large number of subjects used in the exposure GWAS and the outcome GWAS, there are only 4 summary statistics (i.e., two coefficient estimates and their respective standard errors) needed for MR at an IV SNP. The standard asymptotic theory, which requires a large sample size, does not apply since there is only one “observation” at an SNP. To sidestep this issue, we use a support derived from a profile likelihood as an interval estimate for b and assess its coverage probability through simulation. Support is the set of parameter values at which the log profile likelihood is a certain unit below the maximum log profile likelihood. It can be considered an extension of the confidence interval. Simulation studies demonstrate that the 2-unit support has robust coverage while the confidence interval implied by the SMR method has lower-than-nominal coverage.
In addition, we point out that the standard error of b ^ SMR derived from the delta method is the same as the standard error derived from the standard theory on two-stage least-squares (TSLS) regression for one-sample individual-level data in the absence of SNP selection.

2. Materials and Methods

2.1. One-Sample Individual-Level Data

In this subsection, we consider one-sample individual-level data where the IV SNP is not selected for its significant p-value. In this case, the delta method estimate of S E ( b ^ SMR ) is the same as the estimate derived from the theory on the TSLS method. The delta method is the method used by SMR [2]. This result indicates another connection between SMR and TSLS, in addition to the connection that b ^ SMR is the same as the TSLS estimate of b. James E. Pustejovsky proved this relationship in a blog post [22]. Below we provide a similar proof in our current context.
Consider the following GWAS model on exposure x:
x = b g x g + ϵ x ,
where ϵ x correlates with the unobserved confounder u. The reduced-form equation for y is:
y = b g y g + ( b ϵ x + u + ϵ y ) ,
where b g y = b b g x . Let x , y , and g denote the centered vectors of n observations on x, y, and g, respectively. Estimates of b g x and b g y can be obtained as follows: b ^ g x = g x / g g and b ^ g y = g y / g g . Let P = g g / g g . Their second-order moments are estimated by:
V a r ^ ( b ^ g x ) = n 1 x ( I P ) x g g , V a r ^ ( b ^ g y ) = n 1 y ( I P ) y g g , C o v ^ ( b ^ g x , b ^ g y ) = n 1 x ( I P ) y g g .
We note that C o v ^ ( b ^ g x , b ^ g y ) 0 .
The TSLS estimate of b is the least squares estimate of the coefficient in the regression where the response is y and the predictor is P x :
b ^ TSLS = x P y x P x = x g ( g g ) 1 g y x g ( g g ) 1 g x = ( g g ) 1 g y ( g g ) 1 g x = b ^ g y b ^ g x = b ^ SMR .
The delta method estimate of the variance of b ^ SMR is [2]:
V delta = 1 b ^ g x 2 V a r ( b ^ g y ) + b ^ TSLS 2 V a r ( b ^ g x ) 2 b ^ TSLS C o v ( b ^ g x , b ^ g y ) .
Since b ^ g x 2 g g = x P x , we have:
V delta = 1 b ^ g x 2 · 1 n g g y ( I P ) y + b ^ TSLS 2 x ( I P ) x 2 b ^ TSLS x ( I P ) y = n 1 ( y b ^ TSLS x ) ( y b ^ TSLS x ) x P x n 1 ( y b ^ TSLS x ) P ( y b ^ TSLS x ) x P x = n 1 ( y b ^ TSLS x ) ( y b ^ TSLS x ) x P x .
The last equal sign holds because:
P ( y b ^ TSLS x ) = g ( g g ) 1 g y g y g x · g ( g g ) 1 g x = 0 ,
where 0 is a vector of 0’s. The right-hand side of Equation (2) is exactly the estimated variance of b ^ TSLS defined in the standard theory on TSLS method [23]. Combining all these results, we have V delta = V TSLS .
V delta can not be computed from GWAS summary data as there is no information on C o v ( b ^ g x , b ^ g y ) . For the same reason, TSLS can also not be computed from GWAS summary data.
However, when b ^ g x and b ^ g y are derived from two independent samples, C o v ( b ^ g x , b ^ g y ) = 0 and V delta can be computed from GWAS summary data, as is shown in the SMR method [2]. SMR tests whether the exposure has a causal effect on the outcome using a statistic T SMR defined by:
T SMR = b ^ SMR 2 V delta ,
where b ^ SMR = b ^ g y 2 / b ^ g x 2 and
V delta = 1 b ^ g x 2 V a r ( b ^ g y ) + b ^ SMR 2 V a r ( b ^ g x ) .
On the other hand, the TSLS method is not defined for two-samples MR although there are some extensions [24].

2.2. Two Independent Samples with a Selected SNP

In this subsection, we consider two-sample MR where the IV SNP is selected from the exposure GWAS. The purpose of this selection is to ensure that the IV SNP is associated with the exposure. This practice is commonly used in empirical MR studies [2,16]. The selection criterion for an SNP is typically | b ^ g x | / S E ( b ^ g x ) τ for a prespecified τ . For the genome-wide significance level 5 × 10 8 , τ = 5.45131 .
The summary statistics used in an MR analysis are b ^ g x , S E ( b ^ g x ) , b ^ g y , and S E ( b ^ g y ) . To simplify notations, they will be denoted by x, σ x , y, and σ y , respectively, and b g x and b g y will be denoted by μ x and μ y , respectively. We ignore the sampling variation in σ x and σ y as they typically are derived from GWASs of very large sample sizes. A similar assumption is made elsewhere, for instance, [25]. In these notations, x and y have the following sampling distributions, respectively:
x C N ( μ x , σ x 2 ) , y N ( μ y , σ y 2 ) ,
where C N ( · , · ) stands for a conditional normal given | x / σ x | τ and N ( · , · ) a normal distribution. The distribution function C N ( · , · ) was used to construct an approximate conditional likelihood for estimating μ x [26]. The term “approximate” comes from the fact that the distributions of x (prior to selection) and y are approximately normal.
Let α 1 = τ μ x / σ x , α 2 = τ μ x / σ x , and
A = Pr ( x / σ x τ ) + Pr ( x / σ x τ ) = 1 Φ ( α 2 ) + Φ ( α 1 ) ,
where Φ ( · ) is the cumulative distribution function of standard normal. The density function of x is:
1 A σ x ϕ x μ x σ x ,
where ϕ ( · ) is the density function of standard normal. The expected value of x is:
μ x + σ x A ϕ α 2 ϕ α 1
and its variance is:
σ ˜ x 2 = σ x 2 1 + α 2 ϕ ( α 2 ) α 1 ϕ ( α 1 ) A [ ϕ ( α 2 ) ϕ ( α 1 ) ] 2 A 2 .
Note that σ ˜ x 2 is no longer constant; its value depends on μ x . When there is no selection, τ = 0 and α 1 = α 2 . The mean and variance reduce to μ x and σ x 2 , respectively.
Since x is independent of y, the likelihood function L ( μ x , μ y ) is:
L ( μ x , μ y ) = L x ( μ x ) L y ( μ y ) ,
where L x ( μ x ) and L y ( μ y ) are the likelihood functions based on x and y, respectively.
The likelihood function L y ( μ y ) is:
L y ( μ y ) = 1 σ y ϕ y μ y σ y .
The MLE of μ y is apparently μ ^ y = y .
The likelihood function L x ( μ x ) is:
L x ( μ x ) = 1 A σ x ϕ x μ x σ x .
Its score equation is:
x = μ x + σ x A ϕ ( α 2 ) ϕ ( α 1 ) .
This equation determines the MLE μ ^ x for μ x . However, there is no explicit form for μ ^ x . The MLE μ ^ x can be obtained by maximizing L x ( μ x ) numerically.
From Equations (4) and (5), x is an unbiased estimate of the mean of the conditional normal distribution for x. However it is biased for μ x as the mean shown in Equation (4) is a nonlinear function of μ x . Similar comments are made elsewhere [26].
Since σ x > 0 and A > 0 , Equation (5) indicates that when x > 0 μ x must be positive. Otherwise x μ x would be positive and ϕ ( α 2 ) ϕ ( α 1 ) is negative (because α 1 is closer to 0 than α 2 is). There would be a contradiction. Because μ x > 0 implies that the second term of Equation (5) is positive, there is x > μ ^ x > 0 . Following the same logic, when x < 0 , there is x < μ ^ x < 0 . In either case, the naïve Wald ratio y / x underestimates b = μ y / μ x . The MLE of b, denoted by b ^ , is b ^ = y / μ ^ x . b ^ is biased. The expectation of b ^ is E ( y ) E ( 1 / μ ^ x ) b because E ( 1 / μ ^ x ) 1 / μ x .
Figure 1 shows the μ ^ x / σ x as a function of x / σ x . The larger the value of x / σ x , the smaller the absolute difference | x / σ x μ ^ x / σ x | . When | x / σ x | 7.5 (corresponding to a p-value less than or equal to 6.38× 10 14 ), the absolute difference | x / σ x μ ^ x / σ x | is < 0.05 and seems to be negligible.
Under the null:
H 0 : b = μ y / μ x = 0 , μ x 0 ,
there is L ( μ x , μ y ) = L x ( μ x ) L y ( 0 ) . The MLE of μ x is equal to μ ^ x determined by Equation (5). Under the alternative:
H 1 : b = μ y / μ x 0 , μ x 0 ,
the MLE of μ y is y and the MLE of μ x is still μ ^ x . The likelihood ratio statistic for testing H 0 against H 1 is:
T = 2 log L ( μ ^ x , y ) L ( μ ^ x , 0 ) = 2 log L y ( y ) L y ( 0 ) = y 2 σ y 2 χ 1 2 .
This is the “conditional test” we proposed previously [9]. It is more powerful than the SMR statistic shown in Equation (3).
The SMR statistic T SMR shown in Equation (3) does not taking into account the effect of the selection of the IV SNP on the inference. The variance of x is no longer σ x 2 because x is selected. Even if σ x 2 is replaced by the variance σ ˜ x 2 of C N ( μ x , σ x 2 ) , the resulting statistic is less powerful than the T statistic: Replacing b ^ SMR by y / μ ^ x and σ x by σ ˜ x in Equation (3), a modification of the SMR statistic would be:
T ˜ SMR = y 2 / μ ^ x 2 ( σ y 2 + y 2 / μ ^ x 2 · σ ˜ x 2 ) / μ ^ x 2 = y 2 σ y 2 + σ ˜ x 2 · y 2 / μ ^ x 2 < y 2 σ y 2 = T .
That is, T ˜ SMR is less powerful than T; p-values for both statistics are calculated from the same distribution, which is chi-square with 1 df. The larger test statistic corresponds to the smaller p-value.

2.3. Support of Profile Likelihood

We now turn to an interval estimate for b = μ y / μ x . Such an estimate is not trivial since the asymptotic theory is irrelevant as there is effectively only one observation in L ( μ x , u y ) . For this reason, the distribution of b ^ = y / μ ^ x , which is the MLE of b = μ y / μ x , is far from normal. To demonstrate this point, the following simulation study is conducted.
We generate 100,000 x’s from a normal distribution with μ x = 4 and σ x 2 = 1 (so that there are a reasonable amount of x’s to be selected), 7.412 of them satisfy | x | > 5.45131 and are selected. The same number (i.e., 7.412) of y’s are generated independently from a normal distribution with mean μ y = b μ x and variance σ y 2 = 1 . For each ( x , y ) pair, b ^ = y / μ ^ x is calculated. Histograms of b ^ for b = 0 and b = 2 are shown in Figure 2. For b = 0 , the mean of b ^ is 0.0073 and the median is 0.0022. The distribution has a high probability in the neighborhood of 0. For b = 2 , the distribution of b ^ seems to be bimodal and is skewed to the right with a mean equal to 7.4755 and a median equal to 2.5700. Both values are larger than the true value b = 2 . These means and medians are also shown in Table 1 together with results from another simulation study described later.
We consider the profile log-likelihood function p l ( b ) defined by:
p l ( b ) = max μ x [ log L x ( μ x ) + log L y ( b μ x ) ] .
This function is maximized at b ^ = y / μ ^ x and the maximum is equal to p l ( b ^ ) = log L ( μ ^ x , y ) = log L x ( μ ^ x ) + log L y ( y ) , which is also the maximum of log L ( μ x , μ y ) .
A natural interval estimate would be a 1 α profile confidence interval defined as the set of b 0 such that H 0 : b = b 0 is not rejected at significance level α . However, the distribution of the log profile likelihood ratio
2 p l ( b ^ ) p l ( b 0 ) = 2 log L ( μ ^ x , y ) p l ( b 0 )
is unknown for an arbitrary b 0 . The only exception is b 0 = 0 at which
2 log L ( μ ^ x , y ) p l ( 0 ) = 2 log L x ( μ ^ x ) L y ( y ) log L x ( μ ^ x ) L y ( 0 ) = 2 log L y ( y ) log L y ( 0 ) = y 2 σ y 2 χ 1 2 .
Intuitively, the log partial likelihood function p l ( b ) can not be approximated by a quadratic function in the vicinity of b ^ when b 0 0 . As a result, the profile confidence interval for b can not be constructed. An example log partial likelihood function p l ( b ) is shown in Figure 3 for x / σ x = 5.4599 and y / σ y = 12.3155 .
For an interval estimate of b, we use the k-unit support defined by [27]:
b 0 : p l ( b ^ ) p l ( b 0 ) < k = b 0 : p l ( b 0 ) > log L ( μ ^ x , y ) k ,
where k is a prespecified number. This interval consists of b 0 for which p l ( b 0 ) is greater than log L ( μ ^ x , y ) k . It can be regarded as a generalization of the usual confidence interval. For instance, when b 0 = 0 and k = 2 ,
0.95 = Pr ( 2 [ log L ( μ ^ x , y ) p l ( 0 ) ] < 3.84) = Pr ( log L ( μ ^ x , y ) p l ( 0 ) < 1.92) Pr ( log L ( μ ^ x , y ) p l ( 0 ) < 2 ) .
This approximation worsens as | b 0 | moves further away from 0. For the example shown in Figure 3 (i.e., x / σ x = 5.4599 and y / σ y = 12.3155 ), the lower limit of the 2-unit support is 2.146 and the upper limit is greater than 43.406. The exact value of the upper limit is unknown due to numerical issues. It may be unbounded.
By the way the support is constructed, the null H 0 : b 0 = 0 is rejected by the statistic T at significance level α if and only if the k-unit support, where k = [ Φ 1 ( 1 α / 2 ) ] 2 / 2 , contains 0.
We use a simulation study to investigate the coverage of a 2-unit support. For this purpose, data are generated as before but more values for b, i.e., b = 0 , 0.5, 1, 1.5, and 2, are considered. For each value of b, we compare the winner’s-0curse-corrected method and the SMR method in terms of a point estimate of b, an interval estimate of b, and a test of H 0 : b = 0 . Results are reported in Table 1. Both the winner’s-curse-corrected method and SMR method are biased in terms of the mean and median. The SMR 95% confidence interval, computed as b ^ SMR ± 1.96× V delta 1 / 2 , has worse coverage as b increases while the 2-unit support has rather stable coverage. In addition, the test statistic T is more powerful than the SMR method.

3. An Empirical Data Analysis

We conducted a Mendelian randomization analysis of the effect of age of menarche on total pubertal height growth and late pubertal height growth using the winner’s-curse-corrected method and the SMR method. Previously, we used the inverse-variance weighted (IVW) method [5] and the MR-Egger regression method [6] on these exposures and outcomes [15]. In that study, to avoid the winner’s curse caused by the selection of IV SNPs, two other GWAS studies on age at menarche from an MR-Base database were used for validation. IV SNPs were significant in the main GWAS for age at menarche but not in the other two other GWASs which were removed. Such a procedure helps to avoid IV SNPs that are close to the selection threshold. In this study, we use all significant IV SNPs without further validation.
GWAS summary data were retrieved from the MR-Base database (http://www.mrbase.org/ accessed on 27 November 2022). At the genome-wide significance level 5 × 10 8 , 117 instrument SNPs were selected from a previous study on age at menarche with 182,413 females of European ancestry [28]. After pruning for linkage disequilibrium, there are 84 SNPs left. The GWAS summary statistics on adult height were obtained from a study with 4946 females of European ancestry [29]. Thus, the population of this study matches that of the study on age at menarche.
For each SNP, the winner’s-curse-corrected estimate of b and a support are computed in addition to the SMR estimate and the associated confidence interval. To correct for the 84 IV SNPs, the support is 5.9-unit since Pr ( χ 1 2 > 2 × 5.9) = 0.05/ 84 and the nominal coverage of the confidence interval is 0.9994( = 1 0.05/ 84 ) . As discussed previously, this support excludes 0 if and only if the T statistic is significant at the level 0.05/ 84 . The p-value for the winner’s-curse-corrected method is based on the T statistic. SNPs whose supports or confidence intervals do not contain 0 are shown in Table 2.
The estimates of b from the winner’s-curse-corrected method and the SMR method are pretty close to each other for the SNPs shown in Table 2, as are the support and the confidence interval. This is due to the high significance of the association of these SNPs with the age at menarche (p-values: 4.552× 10 15 for rs7514705 and rs7642134; < 4.552× 10 15 for rs7759938). For both total and late pubertal height growth, the T statistic is more significant than the T SMR statistic. For late pubertal height growth, SNP rs7514705 is significant for the T statistic but not for the T SMR statistic.
Another empirical application on the conditional test T is the study of schizophrenia, which was shown in our previous publication [9]. The T statistic identified some strong candidate genes (e.g., AKT3, RGS6, and KCNN3) for schizophrenia that are missed by the SMR method.

4. Discussion

Previously, we proposed a test statistic T for testing H 0 : b = 0 [15]. The current work extends the previous work by focusing on the point and interval estimate of the causal effect b. Because the “sample size” for the MR analysis is 1, the standard likelihood theory does not apply. As a result, it is not straightforward to construct a confidence interval.
We considered two extreme scenarios: one being the one-sample individual-level data and the other being independent-sample summary data. In addition, on of these scenarios is without the winner’s curse caused by the selection of IV SNPs and the other suffers from the winner’s curse. For one-sample individual-level data that is free of the winner’s curse, the SMR method is the same as the TSLS method, not only in terms of the estimates of the causal effect size but also in terms of the variance of the estimates. For two independent-sample summary data with a selected SNP, the SMR test for H 0 : b = 0 is less powerful than the conditional test we proposed earlier [9]. Confidence intervals derived from the SMR method have poor coverage compared to their nominal levels. In comparison, the supports we proposed have stable coverage, at least in our simulation studies.
There are reports (also see our empirical data analyses) showing that the winner’s curse may not have substantial impact on the MR estimates [12]. This is because in these cases the SNPs are strong. As indicated by Figure 1, the winner’s curse affects the relatively weak IV SNPs most. These are the SNPs with | b ^ g x / S E ( b ^ g x ) | < 7.5 (i.e., p < 6.38× 10 14 ). For the strong SNPs, there is not much difference between x and its maximum likelihood estimate. An SNP is strong when either the effect size b or the sample size in the exposure GWAS is, or both, are large. The three-sample design [14] also helps in making an SNP strong by increasing the chance that the selected SNPs are highly significant. Theoretically, however, it does not eliminate the winner’s curse as the probability that a weak SNP is significant in the discovery GWAS and the exposure GWAS is non-zero.
In the previous paragraph, the meaning of the term “weak” may be different than weak instrument in the usual sense although there is no universally-accepted definition of weak instrument. It is relative to the threshold for selecting IV SNPs. SNPs that barely pass the threshold are always weak. In comparison, a weak instrument in the usual sense seems to be characterized in absolute sense, for instance, the F-statistic for testing H 0 : b g x = 0 is less than 10 [30].
Although Equation (1) is on continuous traits, the proposed winner’s-curse-corrected method works for dichotomous traits because it is based on the approximate normality on b ^ g x and b ^ g y .
The current study focuses on a single SNP analysis. A major advantage of such an analysis over multiple SNPs such as the IVW method and the MR-Egger regression method is that it involves less assumptions. For example, the causal effects at different SNPs are allowed to be different. An interesting topic would be to generalize the current work to the case of using multiple SNPs simultaneously.
Our winner’s-curse-corrected method is designed for two independent (i.e., non-overlapping) samples only. This is a limitation although it is not uncommon for methodology development, for instance, [14]. In practice, the study subjects for the exposure GWAS and the outcome GWAS may overlap [12]. The likelihood function L ( μ x , μ y ) will be different than what is presented here. The conditional test T needs to be revised and the concept of support is still applicable. Future research on this topic is warranted.
The winner’s-curse-corrected method has been implemented in the R package iGasso.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

GWAS summary data used in this research were retrieved from the MR-Base database (http://www.mrbase.org/ accessed on 27 November 2022).

Acknowledgments

The author would like to thank two anonymous reviewers for their helpful comments.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GWASgenome-wide association study
IVinstrumental variable
MRMendelian randomization
SNPsingle nucleotide polymorphism
TSLStwo-stage least-squares

References

  1. Hemani, G.; Tilling, K.; Smith, G.D. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 2017, 13, e1007081. [Google Scholar]
  2. Zhu, Z.; Zhang, F.; Hu, H.; Bakshi, A.; Robinson, M.R.; Powell, J.E.; Montgomery, G.W.; Goddard, M.E.; Wray, N.R.; Visscher, P.M.; et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016, 48, 481. [Google Scholar] [CrossRef]
  3. Morrison, J.; Knoblauch, N.; Marcus, J.H.; Stephens, M.; He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nat. Genet. 2020, 52, 740–747. [Google Scholar] [CrossRef]
  4. Davey Smith, G.; Ebrahim, S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? Int. J. Epidemiol. 2003, 32, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Burgess, S.; Butterworth, A.; Thompson, S.G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 2013, 37, 658–665. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Bowden, J.; Davey Smith, G.; Burgess, S. Mendelian randomization with invalid instruments: Effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 2015, 44, 512–525. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Hemani, G.; Zheng, J.; Elsworth, B.; Wade, K.H.; Haberland, V.; Baird, D.; Laurin, C.; Burgess, S.; Bowden, J.; Langdon, R.; et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife 2018, 7, e34408. [Google Scholar] [CrossRef]
  8. Zhao, Q.; Wang, J.; Hemani, G.; Bowden, J.; Small, D.S. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. Ann. Stat. 2020, 48, 1742–1769. [Google Scholar] [CrossRef]
  9. Wang, K.; Han, S. Effect of selection bias on two sample summary data based Mendelian randomization. Sci. Rep. 2021, 11, 7585. [Google Scholar] [CrossRef]
  10. Ye, T.; Shao, J.; Kang, H. Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization. Ann. Stat. 2021, 49, 2079–2100. [Google Scholar] [CrossRef]
  11. Bigdeli, T.B.; Lee, D.; Webb, B.T.; Riley, B.P.; Vladimirov, V.I.; Fanous, A.H.; Kendler, K.S.; Bacanu, S.A. A simple yet accurate correction for winner’s curse can predict signals discovered in much larger genome scans. Bioinformatics 2016, 32, 2598–2603. [Google Scholar] [CrossRef]
  12. Jiang, T.; Gill, D.; Butterworth, A.S.; Burgess, S. An empirical investigation into the impact of winner’s curse on estimates from Mendelian randomization. medRxiv 2022. [Google Scholar] [CrossRef]
  13. Forde, A.; Hemani, G.; Ferguson, J. Review and further developments in statistical corrections for Winner’s Curse in genetic association studies. bioRxiv 2022. [Google Scholar] [CrossRef]
  14. Zhao, Q.; Chen, Y.; Wang, J.; Small, D.S. Powerful three-sample genome-wide design and robust statistical inference in summary-data Mendelian randomization. Int. J. Epidemiol. 2019, 48, 1478–1492. [Google Scholar] [CrossRef]
  15. Jo, E.J.; Han, S.; Wang, K. Estimation of Causal Effect of Age at Menarche on Pubertal Height Growth Using Mendelian Randomization. Genes 2022, 13, 710. [Google Scholar] [CrossRef]
  16. Hannon, E.; Gorrie-Stone, T.J.; Smart, M.C.; Burrage, J.; Hughes, A.; Bao, Y.; Kumari, M.; Schalkwyk, L.C.; Mill, J. Leveraging DNA-methylation quantitative-trait loci to characterize the relationship between methylomic variation, gene expression, and complex traits. Am. J. Hum. Genet. 2018, 103, 654–665. [Google Scholar] [CrossRef] [Green Version]
  17. Lee, B.; Yao, X.; Shen, L. Integrative analysis of summary data from GWAS and eQTL studies implicates genes differentially expressed in Alzheimer’s disease. BMC Genom. 2022, 23, 414. [Google Scholar] [CrossRef]
  18. Porcu, E.; Rüeger, S.; Lepik, K.; Santoni, F.A.; Reymond, A.; Kutalik, Z. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nat. Commun. 2019, 10, 3300. [Google Scholar] [CrossRef] [Green Version]
  19. Porcu, E.; Sadler, M.C.; Lepik, K.; Auwerx, C.; Wood, A.R.; Weihs, A.; Sleiman, M.S.B.; Ribeiro, D.M.; Bandinelli, S.; Tanaka, T.; et al. Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome. Nat. Commun. 2021, 12, 5647. [Google Scholar] [CrossRef]
  20. Zhu, Z.; Zheng, Z.; Zhang, F.; Wu, Y.; Trzaskowski, M.; Maier, R.; Robinson, M.R.; McGrath, J.J.; Visscher, P.M.; Wray, N.R.; et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun. 2018, 9, 224. [Google Scholar] [CrossRef] [Green Version]
  21. Jin, C.; Lee, B.; Shen, L.; Long, Q.; Alzheimer’s Disease Neuroimaging Initiative; Alzheimer’s Disease Metabolomics Consortium. Integrating multi-omics summary data using a Mendelian randomization framework. Brief. Bioinform. 2022, 23, bbac376. [Google Scholar] [CrossRef] [PubMed]
  22. Pustejovsky, J.E. 2SLS Standard Errors and the Delta-Method. 2017. Available online: https://www.jepusto.com/delta-method-and-2sls-ses/ (accessed on 11 November 2022).
  23. Greene, W.H. Econometric Analysis, 6th ed.; Pearson-Prentice Hall: New York, NY, USA, 2008. [Google Scholar]
  24. Zhao, Q.; Wang, J.; Spiller, W.; Bowden, J.; Small, D.S. Two-sample instrumental variable analyses using heterogeneous samples. Stat. Sci. 2019, 34, 317–333. [Google Scholar] [CrossRef] [Green Version]
  25. Burgess, S.; Dudbridge, F.; Thompson, S.G. Combining information on multiple instrumental variables in Mendelian randomization: Comparison of allele score and summarized data methods. Stat. Med. 2016, 35, 1880–1906. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Ghosh, A.; Zou, F.; Wright, F.A. Estimating odds ratios in genome scans: An approximate conditional likelihood approach. Am. J. Hum. Genet. 2008, 82, 1064–1074. [Google Scholar] [CrossRef] [Green Version]
  27. Edwards, A.W.F. Likelihood; CUP Archive: New York, NY, USA, 1984. [Google Scholar]
  28. Perry, J.R.; Day, F.; Elks, C.E.; Sulem, P.; Thompson, D.J.; Ferreira, T.; He, C.; Chasman, D.I.; Esko, T.; Thorleifsson, G.; et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 2014, 514, 92–97. [Google Scholar] [CrossRef] [Green Version]
  29. Cousminer, D.L.; Berry, D.J.; Timpson, N.J.; Ang, W.; Thiering, E.; Byrne, E.M.; Taal, H.R.; Huikari, V.; Bradfield, J.P.; Kerkhof, M.; et al. Genome-wide association and longitudinal analyses reveal genetic loci linking pubertal height growth, pubertal timing and childhood adiposity. Hum. Mol. Genet. 2013, 22, 2735–2747. [Google Scholar] [CrossRef]
  30. Staiger, D.O.; Stock, J.H. Instrumental Variables Regression with Weak Instruments; Cowles Foundation Discussion Papers: London, UK, 1994. [Google Scholar]
Figure 1. Plot of μ ^ x / σ x against x / σ x selected under | x / σ x | > 5.45131 (corresponding to p < 5 × 10 8 ). The vertical line is at | x / σ x | = 7.5 , which corresponds to a p-value of 6.38× 10 14 . The part corresponding to x / σ x < 0 is not shown since μ ^ x / σ x is an odd function of x / σ x .
Figure 1. Plot of μ ^ x / σ x against x / σ x selected under | x / σ x | > 5.45131 (corresponding to p < 5 × 10 8 ). The vertical line is at | x / σ x | = 7.5 , which corresponds to a p-value of 6.38× 10 14 . The part corresponding to x / σ x < 0 is not shown since μ ^ x / σ x is an odd function of x / σ x .
Genes 14 00211 g001
Figure 2. Histogram of simulated b ^ = y / μ ^ x , the MLE of b. Data simulation procedure is described in the text.
Figure 2. Histogram of simulated b ^ = y / μ ^ x , the MLE of b. Data simulation procedure is described in the text.
Genes 14 00211 g002
Figure 3. Profile likelihood for x / σ x = 5.4599 and y / σ y = 12.3155 . The MLE of b is b ^ = 33.416 . The lower limit of the 2-unit support is 2.146 and the upper limit is greater than 43.406. The exact value of the upper limit is unknown due to numerical issues. It may be unbounded.
Figure 3. Profile likelihood for x / σ x = 5.4599 and y / σ y = 12.3155 . The MLE of b is b ^ = 33.416 . The lower limit of the 2-unit support is 2.146 and the upper limit is greater than 43.406. The exact value of the upper limit is unknown due to numerical issues. It may be unbounded.
Genes 14 00211 g003
Table 1. Results of simulation studies with μ x = 4 and σ x = σ y = 1 . The statistic T is T = y 2 / σ y 2 .
Table 1. Results of simulation studies with μ x = 4 and σ x = σ y = 1 . The statistic T is T = y 2 / σ y 2 .
b
Method00.511.52
Winner’s-curse-corrected
    Mean of b ^ 0.00731.87433.74145.60847.4755
    Median of b ^ 0.00220.68431.30911.93272.5700
    Coverage of 2-unit support0.95870.97250.98030.98160.9811
    Power of T for testing H 0 : b = 0 0.04710.52170.98071.00001.0000
SMR
    Mean of b ^ SMR 0.00190.34240.68291.02341.3639
    Median of b ^ SMR −0.33100.34050.67951.01991.3615
    Coverage of 95% CI0.96480.85240.65110.49660.3958
    Power for testing H 0 : b = 0 0.03530.47210.97261.00001.0000
Table 2. Results for the effects of age at menarche on total pubertal height growth and late pubertal height growth. To correct for the 84 IV SNPs, the support is 5.9-unit and the nominal coverage of the CI is 0.9994( = 1 0.05/ 84 ) . This support excludes 0 if and only if the T statistic is significant at the level 0.05/ 84 . The p-value is for the null H 0 : b = 0 . It is computed from the T statistic (the winner’s-curse-corrected method) or the T SMR statistic (the SMR method).
Table 2. Results for the effects of age at menarche on total pubertal height growth and late pubertal height growth. To correct for the 84 IV SNPs, the support is 5.9-unit and the nominal coverage of the CI is 0.9994( = 1 0.05/ 84 ) . This support excludes 0 if and only if the T statistic is significant at the level 0.05/ 84 . The p-value is for the null H 0 : b = 0 . It is computed from the T statistic (the winner’s-curse-corrected method) or the T SMR statistic (the SMR method).
Winner’s-Curse-Corrected Method
SNPGene Name b ^ (5.9-Unit Support)p-Value
Total pubertal height growth
rs7514705TNNI3K2.048 (0.889, 3.807) 8.856× 10 6
rs7642134POU1F12.474 (1.264, 4.433) 1.117× 10 7
Late pubertal height growth
rs7514705TNNI3K1.822 (0.057, 5.091) 5.024× 10 4
rs7759938LIN28B0.931 (0.335, 1.571) 2.756× 10 7
SMR Method
SNPGene Name b ^ SMR (99.94% CI)p-Value
Total pubertal height growth
rs7514705TNNI3K2.042 (0.330, 3.754) 1.108× 10 4
rs7642134POU1F12.466 (0.647, 4.284) 1.110× 10 5
Late pubertal height growth
rs7759938LIN28B0.931 (0.330, 1.533) 5.142× 10 7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, K. Support Interval for Two-Sample Summary Data-Based Mendelian Randomization. Genes 2023, 14, 211. https://doi.org/10.3390/genes14010211

AMA Style

Wang K. Support Interval for Two-Sample Summary Data-Based Mendelian Randomization. Genes. 2023; 14(1):211. https://doi.org/10.3390/genes14010211

Chicago/Turabian Style

Wang, Kai. 2023. "Support Interval for Two-Sample Summary Data-Based Mendelian Randomization" Genes 14, no. 1: 211. https://doi.org/10.3390/genes14010211

APA Style

Wang, K. (2023). Support Interval for Two-Sample Summary Data-Based Mendelian Randomization. Genes, 14(1), 211. https://doi.org/10.3390/genes14010211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop