Next Article in Journal
Automorphisms and Definability (of Reducts) for Upward Complete Structures
Next Article in Special Issue
Point Processes in a Metric Space and Their Applications
Previous Article in Journal
Analysis of Multi-Server Priority Queueing System with Hysteresis Strategy of Server Reservation and Retrials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia

1
Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
2
Departments of Global Pediatric Medicine and Hematology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
3
Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
4
Department of Pediatrics, Emory University School of Medicine, Children’s Healthcare of Atlanta, Atlanta, GA 30322, USA
5
Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
6
Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
7
Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
8
Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
9
MRC Integrative Epidemiology Unit, University of Bristol, Bristol BS8 2BN, UK
10
Department of Public Health and Primary Care, University of Cambridge, Cambridge CB2 0SR, UK
11
Department of Pediatrics, Vanderbilt-Meharry Sickle Cell Disease Center of Excellence, Vanderbilt University Medical Center, Nashville, TN 37232, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2022, 10(20), 3743; https://doi.org/10.3390/math10203743
Submission received: 18 August 2022 / Revised: 3 October 2022 / Accepted: 9 October 2022 / Published: 12 October 2022

Abstract

:
Mendelian randomization (MR) is increasingly employed as a technique to assess the causation of a risk factor on an outcome using observational data. The two-stage least-squares (2SLS) procedure is commonly used to examine the causation using genetic variants as the instrument variables. The validity of 2SLS relies on a representative sample randomly selected from a study cohort or a population for genome-wide association study (GWAS), which is not always true in practice. For example, the extreme phenotype sequencing (EPS) design is widely used to investigate genetic determinants of an outcome in GWAS as it bears many advantages such as efficiency, low sequencing or genotyping cost, and large power in detecting the involvement of rare genetic variants in disease etiology. In this paper, we develop a novel, versatile, and efficient approach, namely MR analysis under Extreme or random Phenotype Sampling (MREPS), for one-sample MR analysis based on samples drawn through either the random sampling design or the nonrandom EPS design. In simulations, MREPS provides unbiased estimates for causal effects, correct type I errors for causal effect testing. Furthermore, it is robust under different study designs and has high power. These results demonstrate the superiority of MREPS over the widely used standard 2SLS approach. We applied MREPS to assess and highlight the causal effect of total fetal hemoglobin on anemia risk in patients with sickle cell anemia using two independent cohort studies. A user-friendly Shiny app web interface was implemented for professionals to easily explore the MREPS.

1. Introduction

Historically, making causal inferences on the effects of exposures or risk factors on important health outcomes in humans required executing a controlled experiment, but this approach is limited due to the necessity of exposure of subjects to risk factors, as well as other logistical considerations. Instead, researchers rely on observational studies of individuals with a history of exposure to a risk factor(s) of interest. Although making causal inferences is crucial to better understand the pathophysiology of a disease, conclusions on causality from observational studies are difficult to establish due to confounding and reverse causation effects [1,2,3]. This limitation in observational studies can be partly overcome by Mendelian randomization (MR) analyses, by utilizing genetic variants as instrumental variables (IVs) for the risk factor. Utilizing genetic variants as IVs facilitates analyzing them as natural experiments within observational cohorts, allowing for causal inference between outcomes and risk factors. With this advantage, MR has gained immense popularity as an epidemiologic methodology for evaluating causal inferences from observational studies.
The advent of genome-wide association studies (GWAS), either based on genotyping array or next generation sequencing (NGS), has been pivotal in better understanding the epidemiology and pathophysiology of many human diseases. Additionally, cohorts with existing genotype/phenotype data support exploration of causal influence of genetic variants in disease onset and severity leveraging MR methodologies. However, performing genomic sequencing of observational cohorts, frequently composed of a large number of participants, is daunting due to cost and many other reasons such as secondary findings and privacy loss. As a solution to these drawbacks, the employment of extreme phenotyping sequencing (EPS) design has been utilized with GWAS data. For more detail, see [4,5,6,7,8,9,10,11,12,13]. For EPS design, individuals are identified from the extreme quantiles of a selected phenotype or trait value distribution. The EPS design bears advantages of cost effectiveness, higher power in detecting rare variants, and efficiency, but, by only evaluating the extremes of phenotype distribution, EPS violates random sampling, a basic tenet of statistical methodology aimed at preventing bias. With this limitation, the EPS design has been largely constrained to typical case-control analyses.
The set-valued approach proposed by [14] with the maximum likelihood framework has unlocked the potential of EPS design by turning it into a procedure that can provide unbiased estimates even under the nonrandom extreme tails sampling process. This approach can produce versatile statistical procedures that can be universally applied to both nonrandom EPS and random sampling by turning the random sampling into a special case of EPS [15]. Because of this success, the set-valued approach has been utilized by [16] in investigating the association of genetic variants of secondary traits under the EPS design. By utilizing the framework suggested by [16], along with the set-valued approach, [15] have proposed a versatile mediation analysis approach, which can produce unbiased estimates under both EPS and random sampling for investigating the mediation effect of genotype through a mediator on a continuous phenotype. These studies show the utilization of the set-valued approach under the EPS design has the capacity to produce unbiased results similar to results obtained under random sampling. In this study, we report the capacity of the combination of EPS design and the set-valued approach in analyzing the causal association of endogenous risk factors on health outcomes by integrating the concepts developed in MR.
Under MR, the concepts and methods of IV and two-stage least-squares (2SLS), originally developed in econometrics [17], are utilized to explore the causal relationships between risk factors and health outcomes. The 2SLS procedure fits two regression models, 1st and 2nd stage regressions, on collected data. In order to obtain the most unbiased estimates of the model parameters via these regressions, it is essential to feed data drawn from a random sampling framework. A direct application of this conventional 2SLS under EPS design is inappropriate, as EPS severally violates the random sampling, which is the essential bias reducing procedure for statistical methods. This constraint limits the further exploration of the causal effects between risk factors and health outcomes by utilizing the genetic variants collected under already executed EPS design study, limiting the further extraction of the important information of the disease etiology. If exploring the causal relationship is also a priority, it is not necessary to redesign another random observational study and carry out another sequencing process, increasing the cost and sacrificing the efficiency and statistical power. However, if the 2SLS approach is versatile enough for handling EPS data, then we can explore all this information from already available EPS design without redesigning a new separate study, turning the EPS into an even more efficient study design that could address multiple statistical questions simultaneously by deploying multiple statistical methodologies under only a single EPS design. These plausible benefits, along with the surge of the EPS and MR designs, demand a new, versatile approach in exploring the causal relationship between risk factors and health outcomes, regardless of if the samples are drawn randomly or not.
Motivated by these potential benefits that the EPS design could bring, we developed novel MR estimation and testing procedures for the causal effect of the risk factor under the maximum likelihood framework by utilizing the limited information maximum likelihood (LIML) [18,19,20], which is the counterpart maximum likelihood approach of the 2SLS. We called this approach MR analysis under Extreme or random Phenotype Sampling (MREPS). Due to the capacity of producing unbiased estimates under the EPS design as well as turning the random sampling into a special case of EPS design, which is an essential characteristic for considering in the development of versatile approaches for both random and EPS samples, we utilized the set-valued approach for deriving versatile maximum likelihood function under both EPS and random sampling.
When deriving the models by considering the associated cost effectiveness, we consider three possible sample selection strategies: a random sampling selection and two EPS samples selection from the original cohort. One consideration for EPS samples is to draw the samples from the extreme tails of the health outcome value distribution of interest. The other is to draw the EPS samples from the extreme tails of the original primary outcome of interest in GWAS, which will become an endogenous risk factor for another clinical or health outcome of interest in another study. For example, in a benign ethnic neutropenia GWAS study, the samples were originally selected based on extreme values of white blood cell count (WBC) (dbGaP Study Accession: phs000507.v1.p1), which is a risk factor of many clinical or health outcomes such as coronary artery disease. Extensive simulations demonstrated that MREPS outperform the standard 2SLS approach under two EPS designs, while producing almost comparable results under the random sampling. The loss of information due to the estimation under the samples drawn from the original cohort is minimum under MREPS, regardless of whether samples are drawn either randomly or from extreme tails. This demonstrates that MREPS offers cost effective solutions without sacrificing desired statistical properties in terms of bias, standard errors, power, and coverage percentages of the confidence intervals (CIs). We then illustrate the use of the proposed MREPS in an example considering the causal effect of total fetal hemoglobin (HbF) on anemia in sickle cell anemia (SCA) patients enrolled in the Sickle Cell Clinical Research Intervention Program (SCCRIP) [21] and Baylor College of Medicine (BCM) as the discovery cohort [22] and using the Silent Cerebral Infarct Transfusion (SIT) Trial data [23] as the validation cohort. This is the first study to demonstrate a causal relationship between HbF (%) and total hemoglobin (Hb) concentration (g/dL) in SCA patients using MR analysis on observational data.
The rest of the article is arranged in the following order. Section 2 discusses the development of the statistical models, estimation process, and testing procedure for the causal effect of the risk factor. Section 3 describes and presents the results of simulation studies. Section 4 discusses the real data application with the results. Finally, a concluding discussion is given in Section 5.

2. Method

In this section, we discuss the development of the versatile statistical models for estimating and testing the causal effect of a risk factor on an outcome under the random sampling and EPS designs by integrating the MR framework.

2.1. Model

We consider a large cohort of N random individuals. The EPS and random samples of size n (<N) are drawn from this large cohort, which will be discussed under each design. Let Y i be the continuous outcome of interest and X i be the continuous risk factor for i = 1 , , N . Let Z i = ( 1 , Z i 1 , , Z i k ) T be the k + 1 -dimensional confounding vector for an integer k and G i be the genetic instrument variable, which takes values 0, 1, and 2, for the number of minor alleles of a single nucleotide polymorphism (SNP), or it can be any combined effect of multiple SNPs such as polygenic or allele scores. Further, we assume G i satisfies all valid instrument variable assumptions [24].
We utilize the concepts of the general MR framework and the set-valued approach [14] for constructing the versatile statistical model under EPS and random samples. We consider G i as a valid instrument variable. Let the outcome, risk factor and confounding factors be linearly associated. Assume the risk factor, genetic instrumental variable, and confounding factors are linearly associated. Then the versatile model is given by:
Y i = β 1 T Z i + β 2 X i + ϵ y X i = γ 1 T Z i + γ 2 G i + ϵ x S i = I ( X i X ) I ( Y i Y ) , where X = ( x 0 , x 1 ] [ x 2 , x 3 ) and Y = ( y 0 , y 1 ] [ y 2 , y 3 ) .
Here, β 1 T = ( β 10 , β 11 , , β 1 k ) , γ 1 T = ( γ 10 , γ 11 , , γ 1 k ) are regression parameters, including the intercepts β 10 , γ 10 ; β 2 is the effect of the risk factor on the outcome, which is also the parameter of interest; γ 2 is the effect of the genetic instrumental variable on the risk factor. Note that to determine if the genetic variant G i is a valid IV, we are interested in estimating and testing the significance of γ 2 , besides the primary interests of estimating and testing β 2 . The associated random error terms ϵ y and ϵ x of each model are assumed to be correlated and followed the bivariate normal density N ( 0 , Σ ) with mean 0 T = ( 0 , 0 ) and covariance matrix Σ = σ y y σ x y σ x y σ x x . The correlation of the error terms is given by ρ = σ x y / σ x x σ y y . The thresholds x 0 < x 1 x 2 < x 3 determine the extreme sample selection region of X i , and the thresholds of the extreme sample selection region for Y i are y 0 < y 1 y 2 < y 3 . Note that the first and second relations in (1) are the 2nd and 1st stage regressions in the general MR framework, respectively.
There is not any known real-world application of the EPS design based on the sample selection from the extreme tails of both X i and Y i values, simultaneously. Because of this limitation, we ignore this hypothetical design, regardless of the possibility of implementing the model (1). For EPS designs based only on either extreme tail of X i or Y i values and the random sampling design, the supports X and Y should be appropriately chosen for each model by choosing appropriate cutoff values x 0 , x 1 , x 2 , x 3 , y 0 , y 1 , y 2 , and y 3 . The form of the indicator S i also changes accordingly. In the following sections, we discuss the utilization of model (1) under each design by choosing the appropriate supports and the indicator S i .

2.1.1. MREPS for Samples from the Extreme Tails of Outcome Distribution

The cutoff values y 0 < y 1 < y 2 < y 3 , with y 1 y 2 , set the thresholds for sample selection under the EPS design based on the extreme outcome values. With this setting Y = ( y 0 , y 1 ) ( y 2 , y 3 ) . The restriction on the support X is relaxed by choosing x 0 = , x 3 = , and x 1 = x 2 so that X becomes ( , ) , which results I ( X i X ) = 1 . As a result, the indicator S i reduces to S i = I ( Y i Y ) . The n individuals whose outcome values fall into the extreme intervals ( y 0 , y 1 ) and ( y 2 , y 3 ) are chosen for the EPS design without imposing any restriction on the risk factor value X i . We denote this EPS design as EPSY throughout the paper.

2.1.2. MREPS for Samples from the Extreme Tails of Risk Factor Distribution

For the EPS design constructed upon the risk factor, say EPSX, let the n individuals whose risk factors fall into either ( x 0 , x 1 ) or ( x 2 , x 3 ) , with x 1 x 2 , are chosen as the EPS samples without imposing any restriction on the outcome values by choosing y 0 = , y 3 = , and y 1 = y 2 . Note that under this setting, X = ( x 0 , x 1 ) ( x 2 , x 3 ) , and S i reduces to S i = I ( X i X ) as Y = ( , ) .

2.1.3. MREPS for Random Samples

For random sampling, we relax all restrictions on both supports X and Y by choosing x 0 = y 0 = , x 3 = y 3 = , and x 1 = x 2 , y 1 = y 2 . With this setting, X = Y = ( , ) . As a result, S i becomes 1 for all i. Under this cutoff selection, the EPS samples turn into the whole cohort sample of N individuals as S i = 1 for all individuals. For the random subsampling design, n individuals can be randomly sampled from the cohort of N individuals. Therefore, the model (1) can be applied to analyze these n individuals. This transformation capability of the set-valued approach implies that the general MR model given by the first two relations of (1) is a special case of the MREPS model (1), indicating that MREPS can be applied to the general MR model, EPSX and EPSY. With this adaptive capability, MREPS turns out to be a versatile, universal approach for MR, regardless of if the samples are random or EPS.

2.2. Estimation and Hypothesis Testing

Notice that under the general MR model given by the first two relations of (1), the typical approach is to employ 2SLS regression to estimate the model parameters under the random sampling framework. However, this approach is not valid for the EPS designs under the extreme tails sampling given by the indicator S i as it violates the random sampling upon which the regression is based. To overcome this restriction, we propose a maximum likelihood approach by utilizing LIML, which is the counterpart maximum likelihood approach of 2SLS, and integrating the set-valued approach proposed by [14].

2.2.1. Samples Drawn from the Extreme Tails of Outcome

For the samples drawn from the extreme tails of the outcome and for given Z i , G i and S i , the likelihood function for the observations X i and Y i can be expressed as
l 1 = i = 1 n Pr ( Y i = y i , X i = x i | Z i , G i , S i = 1 ) ,
where
Pr ( Y i = y i , X i = x i | Z i , G i , S i = 1 ) = Pr ( Y i = y i , X i = x i , S i = 1 | Z i , G i ) Pr ( S i = 1 | Z i , G i )
Note that X = ( , ) and Y = ( y 0 , y 1 ) ( y 2 , y 3 ) for this sample selection, and, hence, S i = I ( Y i Y ) . If y i Y , then
Pr ( Y i = y i , X i = x i , S i = 1 | Z i , G i ) = 1 2 π σ x x σ y y ( 1 ρ 2 ) exp { 1 2 ( 1 ρ 2 ) [ y i β 1 T Z i β 2 x i σ y y 2 + x i γ 1 T Z i γ 2 G i σ x x 2 2 ρ y i β 1 T Z i β 2 x i σ y y x i γ 1 T Z i γ 2 G i σ x x ] } .
Otherwise, Pr ( Y i = y i , X i = x i , S i = 1 | Z i , G i ) = 0 . The marginal probability can be calculated by
Pr ( S i = 1 | Z i , G i ) = 1 2 π σ x x σ y y ( 1 ρ 2 ) Y exp { 1 2 ( 1 ρ 2 ) [ y β 1 T Z i β 2 x σ y y 2 + x γ 1 T Z i γ 2 G i σ x x 2 2 ρ y β 1 T Z i β 2 x σ y y x γ 1 T Z i γ 2 G i σ x x ] } d x d y .
Let θ 1 = γ 1 Z i + γ 2 G i , θ 2 = β 1 T Z i , σ 1 2 = ( 1 ρ 2 ) σ x x σ y y / σ 2 2 , and σ 2 2 = σ x x β 2 2 + σ y y + 2 ρ β 2 σ x x σ y y . Then the probability (5) can be reduced to
Pr ( S i = 1 | Z i , G i ) = F ( θ 2 + θ 1 β 2 , σ 2 ) ( y 1 ) F ( θ 2 + θ 1 β 2 , σ 2 ) ( y 0 ) + F ( θ 2 + θ 1 β 2 , σ 2 ) ( y 3 ) F ( θ 2 + θ 1 β 2 , σ 2 ) ( y 2 ) ,
where F ( θ 2 + θ 1 β 2 , σ 2 ) ( · ) is the normal cumulative distribution with mean θ 2 + θ 1 β 2 and standard deviation σ 2 . The derivation of this relation is given in Supplementary Materials Section S1.1. Using the relations (6), (4) and (3) along with (2), the likelihood can be derived as
l = i = 1 n Pr ( Y i = y i , X i = x i | Z i , G i , S i = 1 ) .
Closed form solutions for the optimum parameters do not exist for the likelihood (7) as it is a nonlinear function. By utilizing appropriate numerical optimization procedures such as quasi- Newton, we can obtain the maximum likelihood estimators numerically from (7). Note that the likelihood given by (4) is the bivariate normal density, which is also LIML. See [19] for more details about LIML.
Next, to investigate the validity of G i as an IV statistically, we test the significance of the parameter γ 2 by utilizing a Wald type test statistic γ ^ 2 2 / v a r ^ ( γ ^ 2 ) comparing H 0 : γ 2 = 0 vs. H 1 : γ 2 0 . Similarly, to test the significance of the effect of the risk factor, we propose the Wald type test statistic, β ^ 2 2 / v a r ^ ( β ^ 2 ) comparing H 0 : β 2 = 0 vs. H 1 : β 2 0 . Here, γ ^ 2 and β ^ 2 are the MLEs of γ 2 and β 2 obtained by maximizing (7), and v a r ^ ( γ ^ 2 ) and v a r ^ ( β ^ 2 ) are, respectively, the squared standard errors of γ ^ 2 and β ^ 2 , which can be obtained via the resultant Hessian matrix from the end of the numerical optimization process of (7). These test statistics follow a χ 2 distribution with one degree of freedom under the null hypotheses. Note that the conclusions of the above hypothesis tests should be made sequentially by first assessing the significance of γ 2 as a valid instrument effect. If γ 2 is not significant then the corresponding genetic variant is not a valid instrument, and, hence, testing the significance of β 2 meaningless.

2.2.2. Samples Drawn from the Extreme Tails of Risk Factor

Similar to the above section, for the samples drawn from the extreme tails of the risk factor, for given Z i , G i and S i , the likelihood function for the observations X i and Y i is still given by the relations (2) and (3). Recall that the supports X and Y are, respectively, given by ( x 0 , x 1 ) ( x 2 , x 3 ) and ( , ) along with S i = I ( X i X ) for this sample draw. The corresponding numerator and denominator of (3) can be evaluated as bellow.
If x i X , then
Pr ( Y i = y i , X i = x i , S i = 1 | Z i , G i ) = 1 2 π σ x x σ y y ( 1 ρ 2 ) exp { 1 2 ( 1 ρ 2 ) [ y i β 1 T Z i β 2 x i σ y y 2 + x i γ 1 T Z i γ 2 G i σ x x 2 2 ρ y i β 1 T Z i β 2 x i σ y y x i γ 1 T Z i γ 2 G i σ x x ] } .
Otherwise, Pr ( Y i = y i , X i = x i , S i = 1 | Z i , G i ) = 0 .
Pr ( S i = 1 | Z i , G i ) = 1 2 π σ x x σ y y ( 1 ρ 2 ) X exp { 1 2 ( 1 ρ 2 ) [ y β 1 T Z i β 2 x σ y y 2 + x γ 1 T Z i γ 2 G i σ x x 2 2 ρ y β 1 T Z i β 2 x σ y y x γ 1 T Z i γ 2 G i σ x x ] } d x d y .
By following a similar approach as above section, it can be shown that
Pr ( S i = 1 | Z i , G i ) = F ( θ 1 , σ x x ) ( x 1 ) F ( θ 1 , σ x x ) ( x 0 ) + F ( θ 1 , σ x x ) ( x 3 ) F ( θ 1 , σ x x ) ( x 2 ) ,
where F ( θ 1 , σ x x ) ( · ) is normal cumulative distribution with mean θ 1 and standard deviation σ x x . Using the relations (8), (10) and (3) along with (2), the likelihood can be derived as
l = i = 1 n Pr ( Y i = y i , X i = x i | Z i , G i , S i = 1 ) .
By utilizing an appropriate numerical optimization procedure, MLEs can be obtained from (11).
Using the same Wald test statistic in the above section, hypothesis testing can be executed for testing the significance of the effects of the genetic variant and the risk factor, sequentially.

2.2.3. Random Samples

Note that, for a random sample, we choose x 1 = x 2 and y 1 = y 2 by setting x 0 = y 0 = and x 3 = y 3 = . Therefore, the indicator S i and the selection probability Pr ( S i = 1 | Z i , G i ) in the denominator of (3) becomes 1 for all i. Hence, the likelihood function reduces to l = i = 1 n Pr ( Y i = y i , X i = x i | Z i , G i ) , where
Pr ( Y i = y i , X i = x i | Z i , G i ) = 1 2 π σ x x σ y y ( 1 ρ 2 ) exp { 1 2 ( 1 ρ 2 ) [ y i β 1 T Z i β 2 x i σ y y 2 + x i γ 1 T Z i γ 2 G i σ x x 2 2 ρ y i β 1 T Z i β 2 x i σ y y x i γ 1 T Z i γ 2 G i σ x x ] } ,
transforming to the LIML, which is the equivalent maximum likelihood approach for 2SLS, under random sample. Using this relationship, the final log likelihood can be calculated, and estimation and the hypothesis testing can be carried out as mentioned above two sections.

3. Simulations

To investigate the proposed MREPS methods in estimation and hypothesis testing, thorough simulations were carried out by comparison with 2SLS procedure. We utilized the widely used t s l s ( ) function, which fits 1st and 2nd stage regression models under the standard 2SLS procedure in estimating model parameters, in the “sem” R package [25] for 2SLS approach. To obtain MLEs from the corresponding likelihood functions, general non-linear optimization procedure implemented by utilizing the R function solnp() in “Rsolnp” package [26,27]. At the end of the optimization process, solnp() outputs the Hessian matrix evaluated via the resulting MLEs. We used this resulting Hessian matrix for evaluating the standard errors (SEs) of the MLEs. This section describes and summarizes the simulation studies along with the findings.
For each setting, 1000 replicates were considered. We first simulated a larger cohort of N individuals. The subsamples of size n were drawn randomly and from the extreme tails of outcome and risk factor to represent the application of each model discussed in above section under the random sampling, EPSY and EPSX. For the EPS designs, the n subjects were drawn from the upper and lower q quantile from the distribution of primary trait values Y i for EPSY and risk factor values X i for EPSX. For EPSY, the support X was set to be ( , ) and for EPSX, the support Y was set to be ( , ) . For these two designs, n satisfies the relation n = 2 q N , where N is the original cohort size. This selection sets the cutoffs x 1 , x 2 and y 1 , y 2 under the the corresponding EPSX and EPSY, accordingly. For EPSX, the values x 0 , x 3 were chosen to be - and , respectively. The cutoffs y 0 , y 3 were, respectively, chosen to be - and for EPSY. We first considered q = 0.01 for assessing the different effect sizes of β 2 . We then considered a set of q ranging from 0.005 to 0.045 with increment of 0.01 for assessing the effect of different n given N = 20 , 000 on the performance of MREPS and 2SLS.
We considered two scenarios in terms of the instrument variables in our simulations. We first considered a single SNP as an IV. The genotype of the SNP with a minor allele frequency (MAF) of 0.3 was generated under the Hardy–Weinberg equilibrium (HWE) assumption. In this setting, the term γ 2 G i is used directly in the regression of the risk factor X i on the outcome Y i in the model (1) by setting γ 2 = 0.5 .
We then considered multiple SNPs each with a MAF of 0.3 contributing to the risk factor. For this multiple SNP setting, we first simulated the genotypes of multiple SNPs independently and then simulated the risk factor according to four situations below to mimic different practical situations, which is similar to scenarios considered by [28]:
(a)
Equal-sized combined effects of 9 and 25 independent SNPs: two separate cases were considered by using different numbers, 9 and 25, of SNPs. The genotype data of each SNP was independently generated under HWE as mentioned above. Then, the equal sized combined effect was considered by using γ 2 G i = γ 2 j = 1 9 G i j , where γ 2 = 0.17 and j = 1 9 G i j is the summation of the 9 independent SNP genotype values for the ith individual. Similarly, for the 25 SNPs, γ 2 G i = γ 2 j = 1 25 G i j , where γ 2 = 0.07 and j = 1 25 G i j is the summation of the 25 independent SNP genotype values for the ith individual. This case represents the utilization of the summary of multiple SNPs with equal-sized association with the risk factor as an IV.
(b)
Different-sized combined effects of 9 and 25 SNPs: with this setting, γ 2 G i turned out to be j = 1 J γ 2 j G i j for J = 9 , 25 . Each independently generated SNP under HWE was multiplied by randomly generated γ 2 j from N ( 0.18 , 0.018 ) for 9 SNPs and from N ( 0.06 , 0.018 ) for 25 SNPs separately. Then, the summation of these products considered as γ 2 G i under 9 and 25 SNPs separately for each i. This scenario represents the utilization of the summary of multiple SNPs with different-sized association with the risk factor as an IV.
(c)
A few large and many small effects of 9 and 25 SNPs: two SNPs generated under H-WE were multiplied by 0.46 and the rest SNPs multiplied by 0.092 separately for 9 and 25 SNPs. Next, for each i, the summations of these products separately considered as γ 2 G i for 9 and 25 SNPs. This represents the utilization of the summary of multiple SNPs with very few large associations and a large number of week associations with the risk factor as an IV.
(d)
A combination of valid and invalid SNPs as an IV, 9 and 25 SNPs: Among 9 SNPs g-enerated under HWE, 4 SNPs were multiplied by 0.37 while multiplying the rest by 0, and the summation of the resulting values was considered as γ 2 G i for the ith individual. Similarly, among 25 SNPs, 12 SNPs were multiplied by 0.14 while multiplying the rest by 0, and the resulting summation was considered as γ 2 G i for the ith individual. This case represents the utilization of the summary of few valid and many invalid SNPs as an IV.
We also considered two confounding covariates ( k = 2 ). The components Z i 1 , Z i 2 of the confounding vector Z i = ( 1 , Z i 1 , Z i 2 ) were generated using Z i 1 N ( 0 , 1 ) and Z i 2 B e r n o u l l i ( p = 0.5 ) for the ith individual. By choosing σ x x = 4 , σ y y = 2 and σ x y = 0.8 for the covariance matrix Σ , the correlated error terms ϵ x and ϵ y were generated from bivariate normal density with the mean vector 0 T = ( 0 , 0 ) and the covariance matrix Σ . By using the generated Z i , G i and ϵ x in the second relation of (1), the endogenous risk factor X i was generated first, and, using this X i , Z i and ϵ y , the outcome Y i was generated using the first relation of (1). After the data were generated, we defined an unweighted polygenic score (PGS) as the IV in multiple SNP settings for MR data analysis because (1) naïve use of the data under analysis to choose SNPs to define PGS or for deriving weights leads to substantial bias; (2) weighted PGS increases the power, but the increase is small when SNPs have similar effect sizes [28].

3.1. Comparison: Bias, Rejection Proportion, Standard Error, and Coverage Percentage

Our major interest was to estimate and test the significance of the effect of the risk factor on the outcome, which can be achieved by estimating and testing β 2 via the proposed estimation and testing processes. To investigate these procedures in estimation and testing, first we assessed the bias and rejection proportion of the null hypothesis using the single SNP setting described above for G i . In achieving this, we set β 1 T = ( 0.5 , 0.4 , 0.7 ) and γ 1 T = ( 0.3 , 0.4 , 0.5 ) . To investigate the behavior of bias and rejection proportion due to the change of the effect size of β 2 , the sequence of points from 0 to 0.6 with an increment of 0.1 was considered for β 2 . Figure 1 displays the plots of β 2 vs. β ^ 2 and the rejection proportion under the random sampling, EPSX and EPSY. Under the random sampling, both MREPS and 2SLS produced almost the same results. The β 2 vs. β ^ 2 curves overlapped with the true β 2 vs. β 2 curve. The rejection proportion plots of MREPS and 2SLS overlapped with each other, becoming comparable approaches under the random sampling, by producing almost same unbiased estimates and the rejection proportions. However, under EPSY, 2SLS became a highly biased approach, and the bias gradually increased as β 2 increases. MREPS still overlapped with the true curve, becoming an unbiased estimator. Although the rejection proportion curves overlap with each other, the estimate of the true causal effect by 2SLS was uninterpretable due to highly biased estimator. Under EPSX, MREPS was still the unbiased estimator, overlapping with the true curve. 2SLS consistently maintained the bias by consistently overestimating the true β 2 . 2SLS produced unreliable rejection proportion, which is almost always 1, even when β 2 = 0 . Therefore, under EPSX and EPSY, we cannot use 2SLS for MR analysis because of biased and uninterpretable parameter estimates and uncontrolled type I error rate.
Under the multiple-SNPs settings (a)(d), we accessed MREPS by comparison with 2SLS quantitatively in terms of average bias, average SE, bias percentages, rejection proportion of the null hypothesis and the coverage percentage of the 95% CI under 9 and 25 SNPs, separately. We kept the true values fixed at the same vectors above and considered the values { 0 , 0.2 , 0.4 } for β 2 . Table 1 summarizes the results under EPSY. The corresponding results for EPSX and random sampling are given in Supplemental Table S1. From Table 1, we observed that MREPS controlled the bias percentages well below 10% for all none zero β 2 settings, while 2SLS failed to control the bias percentages. The resulting bias percentages were above 100% for all nonzero β 2 settings, becoming a highly biased approach. MREPS produced the smallest average SEs in all settings. The coverage percentages of 95% CIs of MREPS were all close to 95 for all settings. However, the coverage percentages of 2SLS CI became close to 95 only when β 2 = 0 . For all nonzero β 2 coverage percentages decreased significantly as β 2 increased under each setting. The only acceptable statistics under 2SLS was the rejection rate of the null hypothesis. However, reliability of these rejection rates was questionable due to the highly biased estimates and the relatively large SEs. From Supplement Table S1, under the random sampling, comparable results were observed from both 2SLS and MREPS. However, for EPSX, the worst results were observed from 2SLS, while observing overall acceptable better results from MREPS. High bias percentages, poor coverage percentages (below 2.2 for all cases) and unreliable (almost 100% rejection rates for the null hypothesis, regardless of the change of the magnitude of the effect size) were observed from 2SLS. These overall observations are consistent with the observations displayed in Figure 1.
Overall, based on Table 1, Supplement Table S1 and Figure 1, we can conclude that both MREPS and 2SLS were comparable unbiased procedures under the random sampling. However, MREPS became the only reliable approach under both EPSX and EPSY, becoming a versatile approach that can be utilized under the random as well as EPS samples in MR analysis.

3.2. Comparison: The Optimum Cost-Effective Subsample Size

One strong and important implication of the above results is that the possibility of reducing the genotyping or sequencing cost by drawing a representative subsample from the original cohort and then executing genotyping or sequencing on the selected subsample. The obvious trade off consequences of such subsample selection is the inevitable sacrifices of the desired qualities of the estimates and test statistics such as increment of bias and reduction of statistical power. However, the above observations imply that MREPS is the most appropriate solution which can optimize such trade off consequences favorably under any of EPSX, EPSY or random subsample drawing. To explore this capability further numerically, we carried out simulations to explore the optimal subsample size n that required to chosen under EPSX, EPSY and random sampling. We considered a cohort of size N = 20,000. For parameter β 2 , the values 0.1 annd 0.4 were considered. For subsample size n the sequence of values from 200 to 2000 were considered with the increment of 200, which corresponds to quantile, q, ranging from 0.005 to 0.045 with the increment of 0.01 under EPSX and EPSY. For IV, a single SNP was considered.
Figure 2 displays the behavior of the estimator β ^ 2 due to the change of the subsample size n or due to the change of the corresponding quantile value q under EPSX and EPSY. Under the considered simulation setting, for the random subsample, n = 600 is an ideal sample size, and there is not much gain for other large sample sizes, indicating the consistency property is achieved. Both 2SLS and MREPS procedures produced roughly the same estimators. For both EPSX and EPSY, the ideal optimal sample size can be as small as 200 (100 in both tails) under MREPS, and there is not any substantial gain for the larger sample sizes. However, regardless of the increment of the sample size n, 2SLS consistently produced nearly the same bias, indicating that 2SLS fails to recover the tradeoff loss due to the subsample selection under EPSX and EPSY.
From these figures we can conclude that there are almost no losses with the valid statistical approach of MREPS due to the subsample selection, in estimation of β 2 . This demonstrates the cost effectiveness of MREPS in reducing the genotyping or sequencing cost without sacrificing desired statistical properties of the estimator.
An important implication of this simulation setting is that it can be treated as a pilot simulation study to determine the required EPS sample size for obtaining unbiased causal effect in advance before conducting the real study. By setting a range for the sample size based on the available resources, simulations can be executed on several sample values chosen from the range to determine the optimal size. This study would be a foundation and it can be adapted based on the objectives and the type of data, including prospective, retrospective, exiting, or newly acquiring data under the EPS design.

4. Unravelling the Effect of HbF on Anemia in Sickle Cell Anemia

To examine the performance of the proposed MREPS approach in real data analysis, we applied the MREPS and 2SLS approaches to assess whether there is a causal effect of HbF on Hb as described below.
Background: Sickle cell anemia (SCA; HbSS and S β 0 -thalathemia) is a catastrophic hemoglo-binopathy caused by a single gene mutation located on chromosome 11 that is defined by chronic hemolytic anemia, recurrent microvascular occlusion, tissue ischemia, and inflammation [29]. Acute SCA-related events (i.e., vaso-occlusive pain, acute chest syndrome, and splenic sequestration) initially manifest during infancy, and young children often develop subclinical end-organ damage [30,31,32,33,34,35,36,37,38,39] that frequently progresses and contributes to morbidity and early mortality for adults with SCA. The degree of anemia (Hb concentration) in individuals with SCA is a known risk factor for disease severity, and the overall percentage of fetal hemoglobin (HbF) produced by an individual instead of sickle hemoglobin (HbS) is a strong modifier of disease severity, including low HbF being an independent risk factor for early mortality [40]. GWAS and basic science experiments have established that HbF is heritable and modulated by three known genes of B C L 11 A , H B S 1 L M Y B and the extended β -globin locus (in total 11 SNPs), which account for over 30% of variance observed in endogenous HbF levels in individuals with SCA, and polymorphisms in these genes are associated with higher Hb concentrations [22,41,42,43,44]. Hydroxyurea is an FDA approved therapy for SCA that largely works through HbF induction [45], with induction of HbF levels of over 20% being associated with marked clinical benefits and higher overall Hb concentrations [46], and induction of HbF is required for improvements in Hb concentrations in murine models [47]. However, no study has shown a causal effect of HbF on Hb in SCA patients.
Data preparation and analysis: 585 SCA participants of the Sickle Cell Clinical Research Intervention Program (SCCRIP) study and Baylor College of Medicine (BCM), aged ≥2 years, were used for genetic associations with cross-sectional baseline HbF [22] and were analyzed in this study. The design of SCCRIP has been described previously [22]. There were no participants on hydroxyurea therapy for at least 90 days prior to HbF measurements. Ninety-four participants had received a blood transfusion within the past 90 days. Whole genome sequencing were performed using Illumina HiSeq X10 sequencers at the HudsonAlpha Institute for Biotechnology Genomic Services Laboratory (Huntsville, AL). We used 11 HbF SNPs based on results from [22]. We defined polygenic score of HbF (PGSHbF) as the summation of the number of high HbF alleles across these 11 SNPs, which is opposite to the PGSHbF defined by [22] that equals to the summation of the number of low HbF alleles. The first five principal components (PCs) were calculated in the original cohort [22]. The covariates we used here were the same as those used in [22], but we further added blood transfusion history as an additional covariate because it was associated with Hb and HbF [48].
To validate the causal effect of HbF on Hb based on PGSHbF and each individual SNP identified in the discovery cohort of SCCRIP/BCM, we analyzed the 724 patients in SIT Trial. The cohort description and GWAS data have been reported by [23]. The missing genotypes for this study were imputed using the Michigan Imputation Server for chromosomes 2, 6, and 11 using the 1000 genomes Phase 3 data as African reference population samples [49]. After imputation, one SNP ( r s 66650371 ) from H B S 1 L M Y B remained unavailable. The dose data for the remaining ten SNPs were extracted and analyzed as continuous variables. PGSHbF was calculated as the unweighted sum of the dosage for the high HbF from ten SNPs using dosage data and was analyzed as a continuous variable. The first five principal components (PCs) were calculated in the original cohort [23] and included in the model to control for potential population stratification. The covariates for which we adjusted here are age, sex, diagnosis (HbSS or S β 0 ), and 5 PCs.
To show the performance of the MREPS approach on data collected under EPSY and EPSX, we then combined the two cohorts to increase the cohort size and mimicked the two EPS designs. For choosing the two extreme tail regions for EPS sample selection under EPSY, we carried out a pilot simulation by using the simulation setting of Figure 2 as a foundation. The size of the combined cohort of SIT and SCCRIP is 1309, which was considered as the size of the original cohort. The percentiles 10%, 20%, 30%, 40%, and 50% were chosen as the candidate cutoff values for the extreme regions. The corresponding sample sizes determined by these cutoffs are 261, 523, 785, 1047, and 1309, respectively. Note that 50% percentile represents the entire cohort of size 1309. The parameters β 2 and γ 2 were set at 0.11 and 1.4 based on the literature [50] and this study. MAF was set as 0.3. The corresponding results are given by Figure S3 and Table S3 of Supplementary Materials. Based on these results, the 20% percentile for the extreme tail sample selection gave a simulated 80% power to detect a causal effect of HbF on Hb in this study at a significance level of 0.05. Without financial budget information, we chose 20% percentile only based on the power. In the real practice, the extreme percentile might be determined based on the available resources and the statistical power. We analyzed 492 participants with Hb in the lowest and highest 20% percentile, which was used to mimic EPSY. We also analyzed 512 participants with HbF in the lowest and highest 20% percentile, which was used to mimic EPSX. For comparison, we also randomly sampled 518 individuals for these MR analyses. Because in SIT, PGSHbF was defined based on 10 SNPs, we then removed SNP r s 6650371 from PGSHbF in SCCRIP/BCM. In the combined data analysis, we adjusted for the common covariates described above with additional covariate cohort (SCCRIP/BCM or SIT). These combined cohort and mimicked EPS analyses were only used to demonstrate the performance of the proposed new MREPS approach.
Pleiotropy: A common problem in MR analyses is the presence of horizontal pleiotropic effects, i.e., a genetic variant has associations with more than one risk factor on different causal pathways [3]. As these three genes are the genetic determinants of HbF, and higher HbF is necessary to lead to higher Hb, it is very unlikely that these three genes affect Hb via other pathways. However, to test the violations of the IV assumptions, we used three approaches. First, we checked if the three genes related to HbF are also significantly associated with the other risk factors of anemia. Second, we used the MR-Egger method to test for pleiotropy [51]. This performs a regression of the SNP-Hb associations on the SNP-HbF associations with the intercept left unconstrained to test for the evidence of bias-generating pleiotropy. Finally, we used the Sargan test for overidentification for pleiotropy [52]. The null hypothesis in this test is that all exogenous instruments are, in fact, exogenous and uncorrelated with the model residuals, and all over-identifying restrictions are, therefore, valid.
Invalid instruments: Here, we only considered three known HbF modifiers. The F statistic from the first stage linear regression in 2SLS or the statistical test for γ 2 was used to evaluate the strength of the genetic instrumental variable. Therefore, we did not analyze r s 3834466 as an individual SNP IV because it was not significantly associated with HbF in both cohorts. We also did not include the other GWAS hits of HbF because they are not validated in independent studies or in biological functional studies.
SCCRIP/BCM Results: The PGSHbF was not associated with any potential confounding variables, including age, sex, blood transfusion history, site, SS/S β 0 , and 5 PCs based on the Spearman correlation analysis and the two-sample Wilcoxon rank sum test. In SCCRIP/BCM, the Sargan test provided evidence of pleiotropic effects in HbF ( p = 0.0005 ), which was not supported by the MR Egger intercept test ( p = 0.36 ). In contrast, in SIT, both Sargan test ( p = 0.15 ) and MR Egger intercept test ( p = 0.72 ) did not provide any evidence of pleiotropic effects in HbF.
A strong association between HbF and Hb was observed after adjusting for covariates (est [estimate] = 0.09, SE = 0.007, p < 0.0001 ) in Supplementary Figure S1. PGSHbF was also associated with HbF (est = 1.64, SE = 0.1, p < 0.0001 ) and Hb (est = 0.17, SE = 0.02, p < 0.0001 ) (Supplementary Figure S1). After including both HbF and PGSHbF in the same model, PGSHbF was not independently significantly associated with Hb (est = 0.04, SE = 0.03, p = 0.13 ), although HbF was still independently significantly associated with Hb (est = 0.08, SE = 0.008, p < 0.0001 ). This implies that PGSHbF does not have a direct effect on Hb except its effect on Hb via HbF, which supports the third MR analysis assumption [53].
Using the entire group (Figure 3, Table 2), MREPS analysis using PGSHbF as the IV showed that there was a significant relationship between HbF and Hb (est = 0.10, SE = 0.013, p = 1 × 10 16 ). All 10 SNPs were significantly associated with HbF. Using each of 10 individual SNP as an IV in the model, there was a significant relationship between HbF and Hb (min p = 6.7 × 10 10 for B C L 11 A SNP r s 1427407 and max p = 0.029 for H B B SNP r s 28440105 ) at a level of 0.05 (Table 2, Supplemental Table S2). The strongest effect for testing the causal effect of HbF on Hb induced by the B C L 11 A SNP r s 1427407 is because this SNP explains more than 20% of the variance of HbF, which is almost 2/3 of the variance explained by PGSHbF that includes the contributions from all 11 SNPs from three genes [22,54]. In contrast, 2SLS only rejected PGSHbF and 9 out of 10 analyzed SNPs (90%). p-values for each test obtained by 2SLS were larger than those obtained by MREPS. Similarly, even though Benjamini–Hochberg approach [55] was used to control false discovery rate (FDR) to adjust for multiple comparisons, MREPS still rejected all 11 tests (PGSHbF + 10 SNPs) at FDR controlled at 0.05. However, 2SLS only rejected 9 (82%) of them: r s 968857 with a raw p value of 0.04 was not significant after multiple comparison correction. This was consistent with our previous simulation result.
SIT validation findings: To validate the findings from the entire cohort analysis from SCCRIP/BCM, we analyzed SIT data as an entire group. We first confirmed the relationships among HbF, HB and PGSHbF (see Supplementary Figure S1). There were no relations between PGSHbF and any potential confounding factor, including age, sex, SS/S β 0 , and 5 PCs based on the Spearman correlation analysis and the two-sample Wilcoxon rank sum test. When we included both HbF and PGSHbF in the same model, both PGSHbF and HbF were significantly associated with Hb although the variance in Hb explained by HbF was 3.5 times higher than that by PGSHbF (14% vs. 4%).
MREPS analysis validated that there was a significant relationship between HbF and Hb at a level of 0.05 using PGSHbF or 6 SNPs as the IV (Figure 3, Table 2). In contrast, 2SLS only showed a significant relationship between HbF and Hb when PGSHbF and 5 SNPs used as IVs at a level of 0.05 (Table 2, Supplemental Table S2). With FDR controlled at 0.05, the same conclusions held. Using SIT cohort, we replicated the finding that there is a causal effect of HbF on Hb predicted from genetic variants from these three genes. This is the first time that the causal effect of HbF on Hb was shown in SCA patients using the MR approach.
Combined SCCRIP/BCM and SIT and EPS designs: After combing SCCRIP/BCM with SIT, MREPS analysis showed that there was a significant relationship between HbF and Hb at a level of 0.05 using PGSHbF or 8 SNPs (90%) as the IV (Figure 4, Table 2, Supplemental Table S2). However, 2SLS missed two SNPs. The analysis results under the mimicked EPS and random sampling design are displayed in Figure 4, Supplemental Table S2 and Figure S2. Using the samples selected under EPSY (Figure 4), MREPS identified 8 out of 10 tests (80%), while 2SLS identified 6 (60%) at a level of 0.05. Similarly, using the samples selected under EPSX (Supplemental Figure S2), both MREPS and 2SLS only identified the relationship via IVs of PGSHbF and two B C L 11 A SNPs ( r s 1427407 and r s 7606173 ) at a level of 0.05. In contrast, using the randomly selected samples (Supplemental Figure S2), MREPS identified 7 out of 10 (70%) tests with correct effect direction at a level of 0.05, but 2SLS only identified 6 (60%). With FDR controlled at 0.05, the same conclusions held.

5. Discussion

Due to the capability of making causal inferences in observational studies, MR is an increasingly used approach in genetic epidemiology. The EPS design is another rapidly surging approach in GWAS studies as it is efficient, cost effective and bearing higher statistical power in rare genetic variants detection in disease etiology. In this study, we developed a maximum likelihood approach (MREPS) by bridging these two studies while preserving the benefits these two approaches offer. We compared MREPS with the standard 2SLS approach under the two EPS and random sampling designs. Although both methods provided comparable unbiased results under the random sampling design, 2SLS failed to produce unbiased estimates for the causal effect under the two EPS designs. As a result, it produced lowest coverage CIs under the EPS designs. In contrast, MREPS preserved these statistics under the EPS designs. MREPS pushed the capabilities of EPS design another step further, as it could produce unbiased estimates under the nonrandom EPS sample selection. The loss of statistical properties (such as bias, SE, and power) of the estimators and test statistics of MREPS is minimum under the random and EPS sample selection. With this capacity, MREPS preserves the cost benefit and the required statistical properties of the estimates simultaneously, regardless of whether the samples are EPS or random. This makes the MREPS a versatile and reliable approach that can be employed under the random and nonrandom extreme tails sampling designs. MREPS satisfies the necessity of having a versatile approach for exploring the causal effect of risk factors under the observational studies due to the rise of nonrandom sample selection designs such as the EPS designs which are commonly used in GWAS/NGS.
In an MR analysis for HbF on anemia risk using MREPS, we found that the higher predicted HbF calculated based on PGSHbF or some individual SNPs protects patients from anemia using the observation SCCRIP/BCM cohort data, which was validated in another independent SIT study. This is the first study in the literature to show the causal effect of HbF on Hb in SCA patients using MR analysis of observation data. In the data analyses using the combined SCCRIP/BCM and the SIT data, the corresponding EPSY, EPSX and randomly sampling from the combined cohort demonstrated the same conclusions as those obtained from the simulation studies. An important implication of the mimicked EPSY, EPSX, and random sampling compared to whole cohort analysis is that only a 40% of samples was used from the large combined cohort for these designs to get close conclusions to the conclusions derived from the combined large cohort. A 60% percent of the data from the large cohort was not used, which means that about 60% of associated cost, time, and resources under the utilization of EPS was saved compared to collecting samples/data on the whole samples, no matter EPS design is for prospective or retrospective studies.
In SCA patients, high HbF is strongly correlated with a low rate of acute vaso-occlusive pain, acute chest syndrome, fewer leg ulcers, and longevity [56]. There is an association of HbF with priapism, renal functional impairment, cerebrovascular disease, and, perhaps, sickle vasculopathy as estimated by tricuspid regurgitant velocity, although there is less conclusive evidence [56]. However, showing potential causal effect of HbF on these clinical outcomes in SCA patients is strongly prohibited because of the limitations of traditional analysis in observational studies. Therefore, MR analysis, which uses PGSHbF as an IV, allows us to examine whether there is potential causal effect of HbF on these SCA-related clinical complications using observation data. Our work demonstrating the causal effect of HbF on anemia provides a theoretical foundation for assessing the causal effect of HbF on the other clinical complications listed above because of the strong biological evidence showing the causal effect of three genes of B C L 11 A , H B S 1 L M Y B and the extended β -globin locus on HbF. We will investigate these research topics in the future.
Another advantage that MREPS offers is that it can estimate all model parameters in the proposed statistical models in Section 2. Because of this advantage, MREPS can be expanded to investigate the validity of the MR assumptions: (a) the association between the risk factor and the IV, and (b) pleiotropy effect. The proposed approach can already address the assumption (a). By utilizing the models (1) and (2) with unobservable confound U used by [51], by utilizing EM algorithm under the MREPS framework, these objectives can be achieved. The possibility of this development will be considered in future.
Although MREPS was proposed for MR analysis using genetic variants as IVs, it can also be readily applied to analyze any epidemiology or clinical or population studies in which data on the outcome or risk factor variable were collected under the random sampling or EPS designs because we can replace genetic variables with the non-genetic IV variables. This will expand the original 2SLS proposed in econometrics [17] to study designs other than random sampling design.
The MREPS framework can also be extended to several future developments. Here, we only considered continuous outcome and continuous risk factors. We can extend it to binary outcomes and/or binary risk factors by applying set-valued framework. In some public health studies, matched case-control design can be applied. By replacing the current likelihood function with clustered likelihood function, we can also extend MREPS to handle the causal inference with genetic data collected under matched case-control design. The MREPS framework provides a theoretical foundation for all these future considerations.
The main objective of these methodology developments is to explore the causation from the data collected under the EPS design because of the lack of valid approaches. Identifying valid SNPs for IVs is a common challenge for all MR approaches including the proposed models. The valid results are not guaranteed from the models under the violations of the IV assumptions. We recommend to use the models with carefully chosen IVs. The pleiotropy effect under the violation of IV assumptions can be tested by MR-Egger and Sargan tests before utilizing the models. When applying the models for sickle cell disease data, in this study, we relied on the existing literature evidence of genetic associations for choosing SNPs as valid IVs. Associating such literature evidence for genetic associations is highly recommend for obtaining valid results from MREPS. MREPS faces the same challenges as 2SLS when exploring causation under the lack of literature evidence of genetic associations.
In conclusion, we have introduced a versatile and efficient method for MR that is comparable to 2SLS under a random sampling design but outperforms 2SLS under EPS designs. We, for the first time, have demonstrated the causal effect of lower endogenous HbF on increased anemia risk in SCA patients. A user-friendly shiny app for MREPS was implemented and is readily available for academic researchers to explore MR analysis.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math10203743/s1.

Author Contributions

Conceptualization, J.S.S.L., J.H.E., S.B., Y.L., Y.C. and G.K.; methodology, J.S.S.L. and G.K.; software, J.S.S.L. and G.K.; validation, J.S.S.L. and G.K.; formal analysis, J.S.S.L. and G.K.; investigation, J.S.S.L., G.K., J.S.H., J.H.E., V.A.S. and M.R.D.; resources, J.H.E., V.A.S., J.S.H., C.M.T., M.R.D. and M.M.; data curation, J.H.E., V.A.S., M.R.D. and J.S.H.; writing—original draft preparation, J.S.S.L., J.H.E. and G.K.; writing—review and editing, J.S.S.L., J.H.E., K.S., S.R.R., V.A.S., J.S.H., M.R.D., Y.L., Y.C., M.M., S.B., C.M.T. and G.K.; visualization, J.S.S.L. and G.K.; supervision, G.K.; project administration, G.K.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was institutionally funded by American Lebanese Syrian Associated Charities (ALSAC) at St. Jude Children’s Research Hospital. The APC was funded by St. Jude Children’s Research Hospital.

Informed Consent Statement

Not applicable.

Data Availability Statement

WGS data for 585 SCCRIP/BCM individuals are available via St. Jude Cloud (https://platform.stjude.cloud/data/cohorts, Accession number: SJC-DS-1006) upon request and subsequent approval by the sickle cell genomic project steering committee.

Code Availability

R codes for MREPS are implemented in the Shiny web application (https://statphyskang.shinyapps.io/statphyskang/) (accessed on 8 October 2022).

Acknowledgments

This research is supported by the American Lebanese and Syrian Associated Charities (ALSAC). We acknowledge the High-Performance Computing Facility (HPCF) at SJCRH for providing shared HPC resources that have contributed to the research results reported within this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MRMendalian randomization
2SLSTwo-stage least-squares
GWASGenome-wide association study
EPSExtreme phenotype sequencing
MREPSMR analysis under extreme or random phenotype sampling
IVInstrumental variables
NGSNext generation sequencing
WBCWhite blood cell count
CIConfidence intervals
HbFTotal fetal hemoglobin
HbHemoglobin
SCASickle cell anemia
SCCRIPSickle cell clinical research intervention program
BCMBaylor college of medicine
SITSilent Cerebral Infarct Transfusion
LIMLLimited information maximum likelihood
MLEMaximum likelihood estimator
SEStandard errors
EPS X EPS design constructed upon the risk factor
EPS Y EPS design constructed upon the outcome
HWEHardy–Weinberg equilibrium
SNPSingle nucleotide polymorphisms
HbSSickle hemoglobin
FDRFalse discovery rate
CPCoverage percentage
PGS HbF Polygenic Score of HbF

References

  1. Davey Smith, G.; Ebrahim, S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease?*. Int. J. Epidemiol. 2003, 32, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Davey Smith, G.; Hemani, G. Mendelian randomization: Genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 2014, 23, R89–R98. [Google Scholar] [CrossRef] [Green Version]
  3. Lawlor, D.A.; Harbord, R.M.; Sterne, J.A.C.; Timpson, N.; Davey Smith, G. Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Stat. Med. 2008, 27, 1133–1163. [Google Scholar] [CrossRef]
  4. Lamina, C. Digging into the extremes: A useful approach for the analysis of rare variants with continuous traits? BMC Proc. 2011, 5, S105. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Dorr, C.; Wu, B.; Remmel, R.; Muthusamy, A.; Schladt, D.; Abrahante, J.E.; Guan, W.; Mannon, R.; Matas, A.; Oetting, W.; et al. Identification of genetic variants associated with tacrolimus metabolism in kidney transplant recipients by extreme phenotype sampling and next generation sequencing. Pharm. J. 2019, 19, 375–389. [Google Scholar] [CrossRef] [PubMed]
  6. Kleinstein, S.E.; Rein, M.; Abdelmalek, M.F.; Guy, C.D.; Goldstein, D.B.; Diehl, A.M.; Moylan, C.A. Whole-Exome Sequencing Study of Extreme Phenotypes of NAFLD. Hepatol. Commun. 2018, 2, 1021–1029. [Google Scholar] [CrossRef] [PubMed]
  7. Peloso, G.; Rader, D.; Gabriel, S.; Kathiresan, S.; Daly, M.J.; Neale, B.M. Phenotypic extremes in rare variant study designs. Eur. J. Hum. Genet. 2016, 24, 924–930. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Amanat, S.; Requena, T.; Lopez-Escamez, J.A. A Systematic Review of Extreme Phenotype Strategies to Search for Rare Variants in Genetic Studies of Complex Disorders. Genes 2020, 11, 987. [Google Scholar] [CrossRef] [PubMed]
  9. Amanat, S.; Gallego-Martinez, A.; Sollini, J.; Perez-Carpena, P.; Espinosa-Sánchez, J.; Aran, I.; Soto-Varela, A.; Batuecas-caletrio, A.; Canlon, B.; May, P.; et al. Burden of Rare Variants in Synaptic Genes in Patients with Severe Tinnitus: An Exome Based Extreme Phenotype Study. EBioMedicine 2020, 66, 103309. [Google Scholar] [CrossRef]
  10. Perez-Gracia, J.L.; Pita, G.; Segura, V.; Pajares, M.J.; Fusco, J.P.; Andueza, M.P.; Sanchez-Bayona, R.; Guruceaga, E.; Mora, M.I.; Gurpide, A.; et al. Whole exome sequencing of germline DNA of individuals presenting extreme phenotypes of high and low risk to develop tobacco-induced lung adenocarcinoma (LUAD) according to KRAS status. J. Clin. Oncol. 2019, 37, 1540. [Google Scholar] [CrossRef]
  11. Emond, M.J.; Louie, T.; Emerson, J.; Zhao, W.; Mathias, R.A.; Knowles, M.R.; Wright, F.A.; Rieder, M.J.; Tabor, H.K.; Nickerson, D.A.; et al. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nat. Genet. 2012, 44, 886–889. [Google Scholar] [CrossRef] [PubMed]
  12. Li, Y.; Levran, O.; Kim, J.; Zhang, T.; Chen, X.; Suo, C. Extreme sampling design in genetic association mapping of quantitative trait loci using balanced and unbalanced case-control samples. Sci. Rep. 2019, 9, l15504. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Xu, C.; Fang, J.; Shen, H.; Wang, Y.P.; Deng, H.W. EPS-LASSO: Test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 2018, 34, 1996–2003. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Kang, G.; Bi, W.; Zhao, Y.; Zhang, J.F.; Yang, J.J.; Xu, H.; Loh, M.L.; Hunger, S.P.; Relling, M.V.P.S.; Cheng, C. A new system identification approach to identify genetic variants in sequencing studies for a binary phenotype. Hum. Hered. 2014, 78, 104–116. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Liyanage, J.S.S.; Estepp, J.H.; Srivastava, K.; Li, Y.; Mori, M.; Kang, G. GMEPS: A fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing. Stat. Appl. Genet. Mol. Biol. 2022, 21. [Google Scholar] [CrossRef]
  16. Bi, W.; Li, Y.; Smeltzer, M.P.; Gao, G.; Zhao, S.; Kang, G. STEPS: An efficient prospective likelihood approach to genetic association analyses of secondary traits in extreme phenotype sequencing. Biostatistics 2020, 21, 33–49. [Google Scholar] [CrossRef]
  17. Angrist, J.D.; Imbens, G.W. Two-Stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity. J. Am. Stat. Assoc. 1995, 90, 431–442. [Google Scholar] [CrossRef]
  18. Anderson, T. Origins of the limited information maximum likelihood and two-stage least squares estimators. J. Econom. 2005, 127, 1–16. [Google Scholar] [CrossRef]
  19. Hayashi, F. Econometrics; Princeton University Press: Princeton, NJ, USA, 2000. [Google Scholar]
  20. Anderson, T.W.; Kunitomo, N.; Sawa, T. Evaluation of the Distribution Function of the Limited Information Maximum Likelihood Estimator. Econometrica 1982, 50, 1009–1027. [Google Scholar] [CrossRef] [Green Version]
  21. Hankins, J.; Estepp, J.; Hodges, J.; Villavicencio, M.; Robison, L.; Weiss, M.; Kang, G.; Schreiber, J.; Porter, J.; Kaste, S.; et al. Sickle Cell Clinical Research and Intervention Program (SCCRIP): A lifespan cohort study for sickle cell disease progression from the pediatric stage into adulthood. Pediatr. Blood Cancer 2018, 65, e27228. [Google Scholar] [CrossRef]
  22. Rampersaud, E.; Kang, G.; Palmer, L.E.; Rashkin, S.R.; Wang, S.; Bi, W.; Alberts, N.M.; Anghelescu, D.; Barton, M.; Birch, K.; et al. A polygenic score for acute vaso-occlusive pain in pediatric sickle cell disease. Blood Adv. 2021, 5, 2839–2851. [Google Scholar] [CrossRef] [PubMed]
  23. Chaturvedi, S.; Bhatnagar, P.; Bean, C.; Steinberg, M.; Milton, J.; Casella, J.; Barron-Casella, E.; Arking, D.; DeBaun, M. Genome-wide association study to identify variants associated with acute severe vaso-occlusive pain in sickle cell anemia. Blood 2017, 130, 686–688. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Baiocchi, M.; Cheng, J.; Small, D.S. Instrumental variable methods for causal inference. Stat. Med. 2014, 33, 2297–2340. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Fox, J.; Nie, Z.; Byrnes, J. sem: Structural Equation Models. R Package Version 3.1-15. 2022. Available online: https://CRAN.R-project.org/package=sem.
  26. Ghalanos, A.; Theussl, S. Rsolnp: General Non-linear Optimization Using Augmented Lagrange Multiplier Method. R package version 1.16. 2015. Available online: https://CRAN.R-project.org/package=Rsolnp.
  27. Ye, Y. Interior Algorithms for Linear, Quadratic, and Linearly Constrained Non-Linear Programming. Ph.D. Thesis, Department of ESS, Stanford University, Stanford, CA, USA, 1987. [Google Scholar]
  28. Burgess, S.; Thompson, S. Use of allele scores as instrumental variables for Mendelian randomization. Int. J. Epidemiol. 2013, 42, 1134–1144. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Ware, R.; de Montalembert, M.; Tshilolo, L.; Abboud, M. Sickle cell disease. Lancet 2017, 390, 311–323. [Google Scholar] [CrossRef]
  30. Gill, F.; Sleeper, L.; Weiner, S.; Brown, A.; Bellevue, R.; Grover, R.; Pegelow, C.; Vichinsky, E. Clinical events in the first decade in a cohort of infants with sickle cell disease. Cooperative Study of Sickle Cell Disease. Blood 1995, 86, 776–783. [Google Scholar] [CrossRef] [Green Version]
  31. Statius van Eps, L.; Pinedo-Veels, C.; de Vries, G.; de Koning, J. Nature of concentrating defect in sickle-cell nephropathy. Microradioangiographic studies. Lancet 1970, 1, 450–452. [Google Scholar] [CrossRef]
  32. Lester, L.; Sodt, P.; Hutcheon, N.; Arcilla, R. Cardiac abnormalities in children with sickle cell anemia. Chest 1990, 98, 1169–1174. [Google Scholar] [CrossRef]
  33. Ohene-Frempong, K.; Weiner, S.; Sleeper, L.; Miller, S.; Embury, S.; Moohr, J.; Wethers, D.; Pegelow, C.; Gill, F. Cerebrovascular accidents in sickle cell disease: Rates and risk factors. Blood 1998, 91, 288–294. [Google Scholar]
  34. McCarville, M.B.; Luo, Z.; Huang, X.; Rees, R.C.; Rogers, Z.R.; Miller, S.T.; Thompson, B.; Kalpatthi, R.; Wang, W.C.; BABY HUG Investigators. Abdominal ultrasound with scintigraphic and clinical correlates in infants with sickle cell anemia: Baseline data from the BABY HUG trial. AJR Am. J. Roentgenol. 2011, 196, 1399–1404. [Google Scholar] [CrossRef] [Green Version]
  35. Rogers, Z.R.; Wang, W.C.; Luo, Z.; Iyer, R.V.; Shalaby-Rana, E.; Dertinger, S.D.; Shulkin, B.L.; Miller, J.H.; Files, B.; Lane, P.A.; et al. Biomarkers of splenic function in infants with sickle cell anemia: Baseline data from the BABY HUG Trial. Blood 2011, 117, 2614–2617. [Google Scholar] [CrossRef] [PubMed]
  36. Ware, R.E.; Rees, R.C.; Sarnaik, S.A.; Iyer, R.V.; Alvarez, O.A.; Casella, J.F.; Shulkin, B.L.; Shalaby-Rana, E.; Strife, C.F.; Miller, J.H.; et al. Renal function in infants with sickle cell anemia: Baseline data from the BABY HUG trial. J. Pediatr. 2010, 156, 66–70. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Pavlakis, S.; Rees, R.; Huang, X.; Brown, R.; Casella, J.; Iyer, R.; Kalpatthi, R.; Luden, J.; Miller, S.; Rogers, Z.; et al. Transcranial doppler ultrasonography (TCD) in infants with sickle cell anemia: Baseline data from the BABY HUG trial. Pediatr. Blood Cancer 2010, 54, 256–259. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Miller, S.; Wang, W.; Iyer, R.; Rana, S.; Lane, P.; Ware, R.; Li, D.; Rees, R.; BABY-HUG Investigators. Urine concentrating ability in infants with sickle cell disease: Baseline data from the phase III trial of hydroxyurea (BABY HUG). Pediatr. Blood Cancer 2010, 54, 265–268. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Wang, W.; Pavlakis, S.; Helton, K.; McKinstry, R.; Casella, J.; Adams, R.; Rees, R.; BABY HUG Investigators. MRI abnormalities of the brain in one-year-old children with sickle cell anemia. Pediatr. Blood Cancer 2008, 51, 643–646. [Google Scholar] [CrossRef]
  40. Platt, O.; Brambilla, D.; Rosse, W.; Milner, P.; Castro, O.; Steinberg, M.; Klug, P. Mortality in sickle cell disease. Life expectancy and risk factors for early death. N. Engl. J. Med. 1994, 330, 1639–1644. [Google Scholar] [CrossRef]
  41. Danjou, F.; Zoledziewska, M.; Sidore, C.; Steri, M.; Busonero, F.; Maschio, A.; Mulas, A.; Perseu, L.; Barella, S.; Porcu, E.; et al. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat. Genet. 2015, 47, 1264–1271. [Google Scholar] [CrossRef] [Green Version]
  42. Stadhouders, R.; Aktuna, S.; Thongjuea, S.; Aghajanirefah, A.; Pourfarzad, F.; van Jcken, W.; Lenhard, B.; Rooks, H.; Best, S.; Menzel, S.; et al. HBS1L-MYB intergenic variants modulate fetal hemoglobin via long-range MYB enhancers. J. Clin. Investig. 2014, 124, 1699–1710. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  43. Stadhouders, R.; Thongjuea, S.; Andrieu-Soler, C.; Palstra, R.; Bryne, J.; Van Den Heuvel, A.; Stevens, M.; de Boer, E.; Kockx, C.; van der Sloot, A.; et al. Dynamic long-range chromatin interactions control Myb proto-oncogene transcription during erythroid development. EMBO J. 2012, 31, 986–999. [Google Scholar] [CrossRef] [Green Version]
  44. Suzuki, M.; Yamazaki, H.; Mukai, H.; Motohashi, H.; Shi, L.; Tanabe, O.; Engel, J.; Yamamoto, M. Disruption of the Hbs1l-Myb locus causes hereditary persistence of fetal hemoglobin in a mouse model. Mol. Cell. Biol. 2013, 33, 1687–1695. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  45. Yawn, B.P.; Buchanan, G.; Afenyi-Annan, A.; Ballas, S.; Hassell, K.; James, A.; Jordan, L.; Lanzkron, S.; Lottenberg, R.; Savage, W.; et al. Management of sickle cell disease: Summary of the 2014 evidence-based report by expert panel members. JAMA 2014, 312, 1033–1048. [Google Scholar] [CrossRef]
  46. Estepp, J.; Smeltzer, M.; Kang, G.; Li, C.; Wang, W.; Abrams, C.; Aygun, B.; Ware, R.; Nottage, K.; Hankins, J. A clinically meaningful fetal hemoglobin threshold for children with sickle cell anemia during hydroxyurea therapy. Am. J. Hematol. 2017, 92, 1333–1339. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Lebensburger, J.; Pestina, T.; Ware, R.; Boyd, K.; Persons, D. Hydroxyurea therapy requires HbF induction for clinical benefit in a sickle cell mouse model. Haematologica 2010, 95, 1599–1603. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Meier, E.; Byrnes, C.; Weissman, M.; Noel, P.; Luban, N.; Miller, J. Expression patterns of fetal hemoglobin in sickle cell erythrocytes are both patient- and treatment-specific during childhood. Pediatr. Blood Cancer 2011, 56, 103–109. [Google Scholar] [CrossRef] [Green Version]
  49. Das, S.; Forer, L.; Schönherr, S.; Sidore, C.; Locke, A.; Kwong, A.; Vrieze, S.; Chew, E.; Levy, S.; McGue, M.; et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016, 48, 1284–1287. [Google Scholar] [CrossRef] [Green Version]
  50. Gardner, K.; Fulford, T.; Silver, N.; Rooks, H.; Angelis, N.; Allman, M.; Nkya, S.; Makani, J.; Howard, J.; Kesse-Adu, R.; et al. g(HbF): A genetic model of fetal hemoglobin in sickle cell disease. Blood Adv. 2018, 2, 235–239. [Google Scholar] [CrossRef] [PubMed]
  51. Bowden, J.; Davey, S.G.; Burgess, S. Mendelian randomization with invalid instruments: Effect estimation and bias detection through egger regression. Int. J. Epidemiol. 2015, 44, 512–525. [Google Scholar] [CrossRef] [Green Version]
  52. Sargan, J.D. The Estimation of Economic Relationships using Instrumental Variables. Econometrica 1958, 26, 393–415. [Google Scholar] [CrossRef]
  53. Burgess, S.; Thompson, S.G. Mendelian Randomization: Methods for Causal Inference Using Genetic Variants; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
  54. Bhanushali, A.A.; Patra, P.; Nair, D.; Verma, H.; Das, B. Genetic variant in the BCL11A (rs1427407), but not HBS1-MYB (rs6934903) loci associate with fetal hemoglobin levels in Indian sickle cell disease patients. Blood Cells Mol. Dis. 2015, 54, 4–8. [Google Scholar] [CrossRef]
  55. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. 1995, 57, 289–300. [Google Scholar] [CrossRef]
  56. Steinberg, M.; Sebastiani, P. Genetic modifiers of sickle cell disease. Am. J. Hematol. 2012, 87, 795–803. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (Top) panel: comparison of estimators obtained via MREPS and 2SLS under the random (left), EPSY (middle) and EPSX (right) designs. (Bottom) panel: rejection proportion of H 0 comparison for MREPS and 2SLS under the random (left), EPSY (middle) and EPSX (right) designs.
Figure 1. (Top) panel: comparison of estimators obtained via MREPS and 2SLS under the random (left), EPSY (middle) and EPSX (right) designs. (Bottom) panel: rejection proportion of H 0 comparison for MREPS and 2SLS under the random (left), EPSY (middle) and EPSX (right) designs.
Mathematics 10 03743 g001
Figure 2. (Left): the behavior of the estimators due to the change of sample size, obtained through MREPS and 2SLS under the random design. (Middle and right): the behavior of the estimators, due to the change of sample size or the proportion of the extreme tails, obtained through MREPS and 2SLS under the EPSY and EPSX, respectively.
Figure 2. (Left): the behavior of the estimators due to the change of sample size, obtained through MREPS and 2SLS under the random design. (Middle and right): the behavior of the estimators, due to the change of sample size or the proportion of the extreme tails, obtained through MREPS and 2SLS under the EPSY and EPSX, respectively.
Mathematics 10 03743 g002
Figure 3. HbF induced by B C L 11 A , H B S 1 L M Y B , and the extended β -globin locus causally influences Hb in SCCRIP/BCM (left) and SIT (right) patients. (Top) panels: X-axis: Genetic association with HbF (%). Y-axis: Genetic association with Hb (g/dL). The genetic associations were done using linear regression models, adjusting for the covariates age, sex, sickle diagnosis (Yes/No), and top 5 PCs. (Bottom) panels: X-axis: Genetic association with HbF (%); Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). The Mendelian randomization analysis was done using MREPS. SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.
Figure 3. HbF induced by B C L 11 A , H B S 1 L M Y B , and the extended β -globin locus causally influences Hb in SCCRIP/BCM (left) and SIT (right) patients. (Top) panels: X-axis: Genetic association with HbF (%). Y-axis: Genetic association with Hb (g/dL). The genetic associations were done using linear regression models, adjusting for the covariates age, sex, sickle diagnosis (Yes/No), and top 5 PCs. (Bottom) panels: X-axis: Genetic association with HbF (%); Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). The Mendelian randomization analysis was done using MREPS. SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.
Mathematics 10 03743 g003
Figure 4. Top left and right panels: comparison between the effects of PGSHbF on HbF and Hb and correlation between HbF and Hb in SCCRIP/BCM and SIT combined cohort and EPSY derived from SCCRIP/BCM and SIT combined cohort, respectively. The p value of the top right plot was calculated using STEPS [16], and the p values of the other plots were calculated using linear regression model, adjusting for covariates of cubic-spline–defined age, sex, sickle diagnosis (Yes/No), and top 5 PCs. Bottom panel: HbF induced by B C L 11 A and H B S 1 L M Y B causally influences Hb in SCCRIP/BCM and SIT combined (left) and EPSY (right) patients. X-axis: Genetic association with HbF (%). Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.
Figure 4. Top left and right panels: comparison between the effects of PGSHbF on HbF and Hb and correlation between HbF and Hb in SCCRIP/BCM and SIT combined cohort and EPSY derived from SCCRIP/BCM and SIT combined cohort, respectively. The p value of the top right plot was calculated using STEPS [16], and the p values of the other plots were calculated using linear regression model, adjusting for covariates of cubic-spline–defined age, sex, sickle diagnosis (Yes/No), and top 5 PCs. Bottom panel: HbF induced by B C L 11 A and H B S 1 L M Y B causally influences Hb in SCCRIP/BCM and SIT combined (left) and EPSY (right) patients. X-axis: Genetic association with HbF (%). Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.
Mathematics 10 03743 g004
Table 1. Average estimators, average standard errors, bias percentages, rejection proportions, and coverage percentages obtained through MREPS and 2SLS under EPSY design.
Table 1. Average estimators, average standard errors, bias percentages, rejection proportions, and coverage percentages obtained through MREPS and 2SLS under EPSY design.
Mean β ^ 2 Mean SE Bias % Rejection Proportion CP
Number of SNPs β 2 MREPS2SLS MREPS2SLS MREPS2SLS MREPS2SLS MREPS2SLS
(a) Equal-sized IV—risk factor associations
90 −0.004−0.02 0.070.3 -- 0.060.05 94.695.1
0.2 0.210.54 0.080.17 −3.19−169.66 0.770.86 93.542.1
0.4 0.410.82 0.090.1 −2.02−105.3 0.991.0 94.55.9
250 −0.01−0.02 0.111.13 -- 0.050.05 95.394.7
0.2 0.190.51 0.120.28 4.84−155.91 0.480.64 9562.7
0.4 0.410.82 0.130.16 −2.6−104.03 0.890.97 94.625.7
(b) Different-sized IV—risk factor associations
90 −0.005−0.04 0.070.28 -- 0.040.03 96.296.8
0.2 0.20.53 0.070.16 −1.02−166.89 0.790.88 94.339.1
0.4 0.40.82 0.080.1 −0.99−104.72 1.01.0 94.24.2
250 −0.01−0.11 0.120.58 -- 0.040.04 95.795.7
0.2 0.180.51 0.120.28 8.28−153.52 0.460.61 95.461.4
0.4 0.410.81 0.130.17 −1.96−102.81 0.860.96 94.827.8
(c) A combination of few large and many small IV - risk factor associations
90 −0.0006−0.01 0.050.21 -- 0.060.05 94.195.1
0.2 0.20.54 0.060.12 −1.45−170.85 0.940.97 93.220.1
0.4 0.410.82 0.060.07 −1.34−105.3 1.01.0 930.5
250 −0.0002−0.01 0.050.18 -- 0.050.06 94.794.5
0.2 0.20.54 0.050.1 −1.15−171.2 0.970.99 93.312.4
0.4 0.40.82 0.060.06 −1.1−104.93 1.01.0 93.10.1
(d) A combination of valid and invalid IV - risk factor associations
90 −0.002−0.02 0.050.19 -- 0.060.05 93.995.2
0.2 0.20.54 0.050.11 −1.42−170.38 0.950.97 91.518.7
0.4 0.40.82 0.060.07 −1.04−104.89 1.01.0 930.2
250 −0.005−0.03 0.080.31 -- 0.050.05 94.995.5
0.2 0.210.54 0.080.17 −4.16−170.31 0.750.84 93.443.8
0.4 0.410.82 0.090.11 −2.91−106.12 0.991.0 93.96.7
SE = Standard Error, CP = Coverage Percentage.
Table 2. MR analyses to assessing the causal effect of HbF on Hb of patients from the two cohorts of SCCRIP/BCM and SIT. Instrumental variables are PGSHbF and r s 1427407 .
Table 2. MR analyses to assessing the causal effect of HbF on Hb of patients from the two cohorts of SCCRIP/BCM and SIT. Instrumental variables are PGSHbF and r s 1427407 .
β ^ 2 SE ( β ^ 2 )p-Value ( β 2 ) γ ^ 2 SE ( γ ^ 2 )p-Value ( γ 2 )
Cohort/DesignIVMREPS2SLSMREPS2SLSMREPS2SLSMREPS2SLSMREPS2SLSMREPS2SLS
SCCRIP/BCMPGSHbF0.10.10.0130.013 1 × 10 16 1 × 10 14 1.6-0.083-0-
r s 1427407 0.0960.0950.0160.016 7 × 10 10 1 × 10 8 4.5-0.3-0-
SITPGSHbF0.150.150.0140.0230 3 × 10 10 1.3-0.14-0-
r s 1427407 0.150.150.0240.027 7 × 10 10 1 × 10 7 3.3-0.52- 2 × 10 10 -
CombinedPGSHbF0.110.110.010.01001.4-0.076-0-
r s 1427407 0.110.110.0110.012004-0.27-0-
EPSYPGSHbF0.110.110.0150.016 2 × 10 13 3 × 10 12 1.4-0.13-0-
r s 1427407 0.120.120.0170.019 2 × 10 12 1 × 10 9 3.8-0.44-0-
EPSXPGSHbF0.0980.110.0130.018 1 × 10 14 5 × 10 10 1.7-0.18-0-
r s 1427407 0.0950.110.0130.019 4 × 10 13 3 × 10 8 5.4-0.67- 7 × 10 16 -
SE = Standard Error; SCCRIP/BCM = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion study cohort; HbF = total fetal hemoglobin; PGSHbF = Polygenic Score of HbF.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liyanage, J.S.S.; Estepp, J.H.; Srivastava, K.; Rashkin, S.R.; Sheehan, V.A.; Hankins, J.S.; Takemoto, C.M.; Li, Y.; Cui, Y.; Mori, M.; et al. A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics 2022, 10, 3743. https://doi.org/10.3390/math10203743

AMA Style

Liyanage JSS, Estepp JH, Srivastava K, Rashkin SR, Sheehan VA, Hankins JS, Takemoto CM, Li Y, Cui Y, Mori M, et al. A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics. 2022; 10(20):3743. https://doi.org/10.3390/math10203743

Chicago/Turabian Style

Liyanage, Janaka S. S., Jeremie H. Estepp, Kumar Srivastava, Sara R. Rashkin, Vivien A. Sheehan, Jane S. Hankins, Clifford M. Takemoto, Yun Li, Yuehua Cui, Motomi Mori, and et al. 2022. "A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia" Mathematics 10, no. 20: 3743. https://doi.org/10.3390/math10203743

APA Style

Liyanage, J. S. S., Estepp, J. H., Srivastava, K., Rashkin, S. R., Sheehan, V. A., Hankins, J. S., Takemoto, C. M., Li, Y., Cui, Y., Mori, M., Burgess, S., DeBaun, M. R., & Kang, G. (2022). A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics, 10(20), 3743. https://doi.org/10.3390/math10203743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop