A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies
Abstract
:1. Introduction
2. Materials and Methods
2.1. Statistical Framework
2.2. Simulation Experiments
2.3. Real Data and Preprocessing
2.4. Mutual Information
2.5. SCAD
2.6. Likelihood Ratio Test
2.7. A Two-Stage Mutual Information Based Bayesian Lasso (MBLASSO) Method
- Step 1: Correct the initial phenotype vector () by the fixed effects, which indicate the population structure in our model.
- Step 2: Calculate the Pearson correlation of the ith SNP with the corrected phenotype (), that is,
- Step 3: Sort the components of vector in descending order and define a subset:
- Step 4: Undertake ISIS-SCAD [11] to revive those non-negligible SNPs that are single uncorrelated but jointly correlated with phenotype, only one iteration is implemented here. Firstly correct the phenotype in Step 1 () by the SNPs selected by SIS-SCAD in Step 3, that is,
- Step 5: Under the same conditions as in Step 2, calculate the mutual information of the ith SNP and the corrected phenotype () by
- Step 6: Similar to Step 3, sort the components of vector in descending order and define another subset:Assume that SNPs corresponding to , , because more than one SNP may share a public mutual information with phenotype. The subset is . Then use SCAD to estimate the effects of SNPs in and select the SNPs with nonzero effect to constitute a new subset , , and . The SNPs in correspond to Type I in mutual information screening. We call this mutual information based SIS followed by SCAD as MI-SIS-SCAD.
- Step 7: Refering to Step 4, correct the phenotype in Step 1 () by SNPs selected by MI-SIS-SCAD, and repeat MI-SIS-SCAD once for to the remaining of the SNPs, which generates a subset of SNPs, . The SNPs in correspond to Type II in mutual information screening. The union of the disjoint subsets and is denoted as , , the size of which is , . We call this process as MI-ISIS-SCAD.
- Step 8: Gather the SNPs selected from Steps 4 and 7 and remove the reduplicated ones. Then obtain a new subset of SNPs, that is, , the size of which is .
- Step 9: Use EM-BLASSO to estimate the effect of the SNPs from and further eliminate the SNPs with zero effect, the source code for EM-BLASSO can be found at https://CRAN.R-project.org/package=mrMLM, where we can also download the program of ISIS EM-BLASSO. Note that the phenotype vector in this step refers to the original one ().
- Step 10: Apply the likelihood ratio test to identify the true QTNs, and set the significant criterion as .
3. Results
3.1. The Overlap Ratio between Pearson Correlation and Mutual Information Based Screening in MBLASSO
3.2. Statistical Power for QTN Detection
3.3. Average Accuracy for QTN Effects
3.4. Type 1 Error Ratio
3.5. Computational Efficiency
3.6. Arabidopsis Thaliana Dataset Analysis
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
References
- Yu, J.; Pressoir, G.; Briggs, W.H.; Bi, I.V.; Yamasaki, M.; Doebley, J.F.; McMullen, M.D.; Gaut, B.S.; Nielsen, D.M.; Holland, J.B. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006, 38, 203–208. [Google Scholar] [CrossRef]
- Kang, H.M.; Zaitlen, N.A.; Wade, C.M.; Kirby, A.; Heckerman, D.; Daly, M.J.; Eskin, E. Efficient control of population structure in model organism association mapping. Genetics 2008, 178, 1709–1723. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, Z.; Ersoz, E.; Lai, C.Q.; Todhunter, R.J.; Tiwari, H.K.; Gore, M.A.; Bradbury, P.J.; Yu, J.; Arnett, D.K.; Ordovas, J.M.; et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 2010, 42, 355–360. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lippert, C.; Listgarten, J.; Liu, Y.; Kadie, C.M.; Davidson, R.I.; Heckerman, D. FaST linear mixed models for genome-wide association studies. Nat. Methods 2011, 8, 833–835. [Google Scholar] [CrossRef]
- Zhou, X.; Stephens, M. Genome-wide efficient mixed model analysis for association studies. Nat. Genet. 2012, 44, 821–824. [Google Scholar] [CrossRef] [Green Version]
- Tamba, C.L.; Ni, Y.L.; Zhang, Y.M. Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies. PLoS Comput. Biol. 2017, 13, e1005357. [Google Scholar] [CrossRef]
- Wu, T.T.; Chen, Y.F.; Hastie, T.; Sobel, E.; Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25, 714–721. [Google Scholar] [CrossRef]
- Cho, S.; Kim, H.; Oh, S.; Kim, K.; Taesung, P. Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis. BMC Proc. 2009, 3, S25. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Das, K.; Fu, G.; Li, R.; Wu, R. The Bayesian lasso for genome-wide association studies. Bioinformatics 2011, 27, 516–523. [Google Scholar] [CrossRef]
- Xu, S. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity 2010, 105, 483–494. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
- Li, G.; Peng, H.; Zhang, J.; Zhu, L. Robust rank correlation based screening. Ann. Stat. 2012, 40, 1846–1877. [Google Scholar] [CrossRef] [Green Version]
- Li, R.; Zhong, W.; Zhu, L. Feature screening via distance correlation learning. J. Am. Stat. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef] [Green Version]
- Li, R.; Liu, J.; Lou, L. Variable selection via partial correlation. Statistica Sinica 2017, 27, 983–996. [Google Scholar] [CrossRef]
- Jiang, L.; Liu, J.; Zhu, X.; Ye, M.; Sun, L.; Lacaze, X.; Wu, R. 2HiGWAS: A unifying high-dimensional platform to infer the global genetic architecture of trait development. Brief. Bioinform. 2015, 16, 905–911. [Google Scholar] [CrossRef] [Green Version]
- Cui, Y.; Zhang, F.; Zhou, Y. The application of multi-locus GWAS for the detection of salt-tolerance loci in rice. Front. Plant Sci. 2018, 9, 1464. [Google Scholar] [CrossRef] [Green Version]
- Liu, J.; Ye, M.; Zhu, S.; Jiang, L.; Sang, M.; Gan, J.; Wang, Q.; Huang, M.; Wu, R. Two-stage identification of SNP effects on dynamic poplar growth. Plant J. 2018, 93, 286–296. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Han, F.; Liu, H. Challenges of big data analysis. Nat. Sci. Rev. 2014, 1, 293–314. [Google Scholar] [CrossRef] [Green Version]
- Jing, P.J.; Shen, H.B. MACOED: A multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics 2015, 31, 634–641. [Google Scholar] [CrossRef] [PubMed]
- Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; Mcvean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting novel associations in large data sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
- Atwell, S.; Huang, Y.S.; Vilhjalmsson, B.J.; Willems, G.; Horton, M.; Li, Y.; Meng, D. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 2010, 465, 627–631. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.B.; Feng, J.Y.; Ren, W.L.; Huang, B.; Zhou, L.; Wen, Y.J.; Zhang, J.; Dunwell, J.M.; Xu, S.; Zhang, Y.M. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci. Rep. 2016, 6, 19444. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Togninalli, M.; Seren, Ü.; Freudenthal, J.A.; Monroe, J.G.; Meng, D.; Nordborg, M.; Weigel, D.; Borgwardt, K.; Korte, A.; Grimm, D.G. AraPheno and the AraGWAS Catalog 2020: A major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana. Nucleic Acids Res. 2019, 48, D1063–D1068. [Google Scholar] [CrossRef]
- Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.R.; Bender, D.; Maller, J.; Sklar, P.; Bakker, P.I.W.D.; Daly, M.J. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef] [Green Version]
- Alexander, D.H.; Novembre, J.; Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009, 19, 1655–1664. [Google Scholar] [CrossRef] [Green Version]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
- Ren, W.L.; Wen, Y.J.; Dunwell, J.M.; Zhang, Y.M. pKWmEB: Integration of Kruskal-Wallis test with empirical Bayes under polygenic background control for multi-locus genome-wide association study. Heredity 2018, 120, 208–218. [Google Scholar] [CrossRef]
- Berardini, T.Z.; Mundodi, S.; Reiser, L.; Huala, E.; Garcia-Hernandez, M.; Zhang, P.; Mueller, L.A.; Yoon, J.; Doyle, A.; Lander, G.; et al. Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 2004, 135, 745–755. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, J.; Feng, J.Y.; Ni, Y.L.; Wen, Y.J.; Niu, Y.; Tamba, C.L.; Yue, C.; Song, Q.; Zhang, Y.M. pLARmEB: Integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity 2017, 118, 517–524. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Simulations | Pearson Correlation Screening | Mutual Information Screening | ||||
---|---|---|---|---|---|---|
Type I | Type II | Total | Type I | Type II | Total | |
1 | 0.470 (15.8) | 0.086 (50.4) | 0.184 (66.2) | 0.417 (18.2) | 0.298 (15.5) | 0.356 (33.7) |
2 | 0.452 (16.6) | 0.091 (50.3) | 0.181 (66.9) | 0.398 (19.0) | 0.285 (17.5) | 0.334 (36.5) |
3 | 0.457 (14.6) | 0.090 (50.8) | 0.173 (65.4) | 0.383 (18.4) | 0.278 (17.4) | 0.323 (35.8) |
Traits | MBLASSO | ISIS EM-BLASSO | GEMMA | EM-BLASSO | ||||
---|---|---|---|---|---|---|---|---|
AIC | BIC | AIC | BIC | AIC | BIC | AIC | BIC | |
LDV | −360.543 | −307.436 | −318.966 | −275.230 | 1312.693 | 1322.065 | −113.638 | −104.266 |
SDV | −169.269 | −114.028 | −140.485 | −85.245 | 1356.907 | 1372.251 | 149.095 | 149.095 |
2W | −103.363 | −51.957 | −65.172 | −7.718 | 584.000 | 587.024 | 148.247 | 160.342 |
4W | −124.109 | −74.084 | −98.993 | −54.527 | 1253.281 | 1258.839 | 22.893 | 39.568 |
Traits | MBLASSO | ISIS EM-BLASSO | GEMMA | EM-BLASSO |
---|---|---|---|---|
LDV | 5/17 | 3/14 | 0/3 | 0/3 |
SDV | 4/18 | 2/18 | 1/5 | 0/0 |
2W | 2/17 | 1/19 | 0/1 | 0/4 |
4W | 3/18 | 2/16 | 1/2 | 0/6 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, H.; Yu, Z.; An, J.; Han, G.; Ma, Y.; Tang, R. A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies. Entropy 2020, 22, 329. https://doi.org/10.3390/e22030329
Guo H, Yu Z, An J, Han G, Ma Y, Tang R. A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies. Entropy. 2020; 22(3):329. https://doi.org/10.3390/e22030329
Chicago/Turabian StyleGuo, Hongping, Zuguo Yu, Jiyuan An, Guosheng Han, Yuanlin Ma, and Runbin Tang. 2020. "A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies" Entropy 22, no. 3: 329. https://doi.org/10.3390/e22030329
APA StyleGuo, H., Yu, Z., An, J., Han, G., Ma, Y., & Tang, R. (2020). A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies. Entropy, 22(3), 329. https://doi.org/10.3390/e22030329