Find the Needle in the Haystack, Then Find It Again: Replication and Validation in the ‘Omics Era
Abstract
:“The natural scientist is concerned with a particular kind of phenomenon… he has to confine himself to that which is reproducible… I do not claim that the reproducible by itself is more important than the unique. But I do claim that the unique exceeds the treatment by scientific method.”Wolfgang Pauli
1. Terminology
1.1. Replication
1.2. Validation
2. A Deeper Dive into Validation
2.1. Selecting Features for Validation
2.1.1. Correcting for Multiple Comparisons
2.1.2. Metrics for Selection of Findings to Follow-Up Studies
2.1.3. Validation beyond Single Features
2.2. Selecting a Population for Validation
2.3. Interpreting Findings
2.4. Metadata
3. Summary of Considerations for Metabolomics
4. Future Directions: Integration of Multiple ‘Omics
Author Contributions
Funding
Conflicts of Interest
References
- Colhoun, H.M.; McKeigue, P.M.; Davey Smith, G. Problems of reporting genetic associations with complex outcomes. Lancet (London, England) 2003, 361, 865–872. [Google Scholar] [CrossRef]
- Igl, B.W.; Konig, I.R.; Ziegler, A. What do we mean by ‘replication’ and ‘validation’ in genome-wide association studies? Hum. Hered. 2009, 67, 66–68. [Google Scholar] [CrossRef] [PubMed]
- Greenwood, C.M.; Rangrej, J.; Sun, L. Optimal selection of markers for validation or replication from genome-wide association studies. Genet. Epidemiol. 2007, 31, 396–407. [Google Scholar] [CrossRef] [PubMed]
- Todd, J.A.; Walker, N.M.; Cooper, J.D.; Smyth, D.J.; Downes, K.; Plagnol, V.; Bailey, R.; Nejentsev, S.; Field, S.F.; Payne, F.; et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat. Genet. 2007, 39, 857–864. [Google Scholar] [CrossRef]
- Gudbjartsson, D.F.; Arnar, D.O.; Helgadottir, A.; Gretarsdottir, S.; Holm, H.; Sigurdsson, A.; Jonasdottir, A.; Baker, A.; Thorleifsson, G.; Kristjansson, K.; et al. Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 2007, 448, 353–357. [Google Scholar] [CrossRef]
- Clarke, G.M.; Carter, K.W.; Palmer, L.J.; Morris, A.P.; Cardon, L.R. Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet. 2007, 81, 995–1005. [Google Scholar] [CrossRef] [Green Version]
- Ang, J.E.; Revell, V.; Mann, A.; Mäntele, S.; Otway, D.T.; Johnston, J.D.; Thumser, A.E.; Skene, D.J.; Raynaud, F. Identification of human plasma metabolites exhibiting time-of-day variation using an untargeted liquid chromatography-mass spectrometry metabolomic approach. Chronobiol. Int. 2012, 29, 868–881. [Google Scholar] [CrossRef] [Green Version]
- Kim Nk Park, H.M.; Lee, J.; Ku, K.-M.; Lee, C.H. Seasonal Variations of Metabolome and Tyrosinase Inhibitory Activity of Lespedeza maximowiczii during Growth Periods. J. Agric. Food Chem. 2015, 63, 8631–8639. [Google Scholar] [CrossRef]
- Wallace, M.; Hashim, Y.Z.; Wingfield, M.; Culliton, M.; McAuliffe, F.; Gibney, M.J.; Brennan, L. Effects of menstrual cycle phase on metabolomic profiles in premenopausal women. Hum. Reprod. (Oxford, England) 2010, 25, 949–956. [Google Scholar] [CrossRef] [Green Version]
- Perng, W.; Rifas-Shiman, S.L.; Sordillo, J.; Hivert, M.F.; Oken, E. Metabolomic Profiles of Overweight/Obesity Phenotypes During Adolescence: A Cross-Sectional Study in Project Viva. Obesity (Silver Spring, Md) 2020, 28, 379–387. [Google Scholar] [CrossRef]
- Perng, W.; Hector, E.C.; Song, P.X.K.; Tellez Rojo, M.M.; Raskind, S.; Kachman, M.; Cantoral, A.; Burant, C.F.; Peterson, K.E. Metabolomic Determinants of Metabolic Risk in Mexican Adolescents. Obesity (Silver Spring, Md) 2017, 25, 1594–1602. [Google Scholar] [CrossRef] [PubMed]
- Perng, W.; Oken, E.; Roumeliotaki, T.; Sood, D.; Siskos, A.P.; Chalkiadaki, G.; Dermitzaki, E.; Vafeiadi, M.; Kyrtopoulos, S.; Kogevinas, M.; et al. Leptin, acylcarnitine metabolites and development of adiposity in the Rhea mother-child cohort in Crete, Greece. Obes. Sci. Pract. 2016, 2, 471–476. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Perng, W.; Gillman, M.W.; Fleisch, A.F.; Michalek, R.D.; Watkins, S.M.; Isganaitis, E.; Patti, M.E.; Oken, E. Metabolomic profiles and childhood obesity. Obesity (Silver Spring, Md) 2014, 22, 2570–2578. [Google Scholar] [CrossRef]
- Butte, N.F.; Liu, Y.; Zakeri, I.F.; Mohney, R.P.; Mehta, N.; Voruganti, V.S.; Göring, H.; Cole, S.A.; Comuzzie, A.G. Global metabolomic profiling targeting childhood obesity in the Hispanic population. Am. J. Clin. Nutr. 2015, 102, 256–267. [Google Scholar] [CrossRef] [PubMed]
- Perng, W.; Rifas-Shiman, S.L.; Hivert, M.F.; Chavarro, J.E.; Oken, E. Branched Chain Amino Acids, Androgen Hormones, and Metabolic Risk Across Early Adolescence: A Prospective Study in Project Viva. Obesity (Silver Spring, Md) 2018, 26, 916–926. [Google Scholar] [CrossRef]
- Wang, T.J.; Larson, M.G.; Vasan, R.S.; Cheng, S.; Rhee, E.P.; McCabe, E.; Lewis, G.D.; Fox, C.S.; Jacques, P.F.; Fernandez, C.; et al. Metabolite profiles and the risk of developing diabetes. Nat. Med. 2011, 17, 448–453. [Google Scholar] [CrossRef]
- Flores-Guerrero, J.L.; Osté, M.C.J.; Kieneker, L.M.; Gruppen, E.G.; Wolak-Dinsmore, J.; Otvos, J.D.; Connelly, M.A.; Bakker, S.J.L.; Dullaart, R.P.F. Plasma Branched-Chain Amino Acids and Risk of Incident Type 2 Diabetes: Results from the PREVEND Prospective Cohort Study. J. Clin. Med. 2018, 7, 513. [Google Scholar] [CrossRef] [Green Version]
- Chen, T.; Cao, Y.; Zhang, Y.; Liu, J.; Bao, Y.; Wang, C.; Jia, W.; Zhao, A. Random Forest in Clinical Metabolomics for Phenotypic Discrimination and Biomarker Selection. Evid.-Based Complement. Altern. Med. 2013, 2013, 298183. [Google Scholar] [CrossRef] [Green Version]
- Kokla, M.; Virtanen, J.; Kolehmainen, M.; Paananen, J.; Hanhineva, K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study. BMC Bioinform. 2019, 20, 492. [Google Scholar] [CrossRef] [Green Version]
- Fonville, J.M.; Richards, S.E.; Barton, R.H.; Boulange, C.L.; Ebbels, T.M.D.; Nicholson, J.K.; Holmes, E.; Dumas, M.-E. The evolution of partial least squares models and related chemometric approaches in metabonomics and metabolic phenotyping. J. Chemom. 2010, 24, 636–649. [Google Scholar] [CrossRef]
- Robert, T. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B (Statistical Methodology) 2011, 73, 273–282. [Google Scholar] [CrossRef]
- Sudlow, C.; Gallacher, J.; Allen, N.; Beral, V.; Burton, P.; Danesh, J.; Downey, P.; Elliott, P.; Green, J.; Landray, M.; et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015, 12, e1001779. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Marigorta, U.M.; Rodríguez, J.A.; Gibson, G.; Navarro, A. Replicability and Prediction: Lessons and Challenges from GWAS. Trends Genet. TIG 2018, 34, 504–517. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodological) 1995, 57, 289–300. [Google Scholar] [CrossRef]
- Benjamini, Y.; Hochberg, Y. False discovery rate estimation for metabolomics. Nat. Methods 2018, 15, 15. [Google Scholar] [CrossRef]
- Newgard, C.B.; An, J.; Bain, J.R.; Muehlbauer, M.J.; Stevens, R.D.; Lien, L.F.; Haqq, A.M.; Shah, S.H.; Arlotto, M.; Slentz, C.A.; et al. A branched-chain amino acid-related metabolic signature that differentiates obese and lean humans and contributes to insulin resistance. Cell Metab. 2009, 9, 311–326. [Google Scholar] [CrossRef] [Green Version]
- Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef] [Green Version]
- Johnson, R.C.; Nelson, G.W.; Troyer, J.L.; Lautenberger, J.A.; Kessing, B.D.; Winkler, C.A.; O’Brien, S.J. Accounting for multiple comparisons in a genome-wide association study (GWAS). BMC Genomics 2010, 11, 724. [Google Scholar] [CrossRef] [Green Version]
- Hastie, T.; Tibshirani, R.; Sherlock, G.; Eisen, M.; Brown, P.; Botstein, D. Imputing Missing Data for Gene Expression Arrays; Online; Department SUS, Stanford University: Palo Alto, CA, USA, 1999. [Google Scholar]
- Wei, R.; Wang, J.; Su, M.; Jia, E.; Chen, S.; Chen, T.; Ni, Y. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci. Rep. 2018, 8, 663. [Google Scholar] [CrossRef] [Green Version]
- Thomas, D.C.; Casey, G.; Conti, D.V.; Haile, R.W.; Lewinger, J.P.; Stram, D.O. Methodological Issues in Multistage Genome-wide Association Studies. Statistical science: A review. J. Inst. Math. Stat. 2009, 24, 414–429. [Google Scholar] [CrossRef] [Green Version]
- Hill, A.B. The Environment and Disease: Association or Causation? Proc. R. Soc. Med. 1965, 58, 295–300. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fedak, K.M.; Bernal, A.; Capshaw, Z.A.; Gross, S. Applying the Bradford Hill criteria in the 21st century: How data integration has changed causal inference in molecular epidemiology. Emerg. Themes Epidemiol. 2015, 12, 14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cook, J.R.; Stefanski, L.A. Simulation-Extrapolation Estimation in Parametric Measurement Error Models. J. Am. Stat. Assoc. 1994, 89, 1314–1328. [Google Scholar] [CrossRef]
- Bach, F.R. Bolasso: Model consistent lasso estimation through bootstrap. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–8 July 2008; pp. 33–40. [Google Scholar]
- Hunter, D.J.; Kraft, P. Drinking from the fire hose--statistical issues in genomewide association studies. N. Engl. J. Med. 2007, 357, 436–439. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gorlov, I.P.; Moore, J.H.; Peng, B.; Jin, J.L.; Gorlova, O.Y.; Amos, C.I. SNP characteristics predict replication success in association studies. Hum. Genet. 2014, 133, 1477–1486. [Google Scholar] [CrossRef] [Green Version]
- Lewinger, J.P.; Conti, D.V.; Baurley, J.W.; Triche, T.J.; Thomas, D.C. Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet. Epidemiol. 2007, 31, 871–882. [Google Scholar] [CrossRef]
- Lovmar, L.; Ahlford, A.; Jonsson, M.; Syvänen, A.C. Silhouette scores for assessment of SNP genotype clusters. BMC Genomics 2005, 6, 35. [Google Scholar] [CrossRef]
- Efron, B. Bootstrap Methods: Another Look at the Jackknife. Anna. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
- Shannon, C.P.; Chen, V.; Takhar, M.; Hollander, Z.; Balshaw, R.; McManus, B.M.; Tebbutt, S.J.; Sin, D.D.; Ng, R.T. SABRE: A method for assessing the stability of gene modules in complex tissues and subject populations. BMC Bioinform. 2016, 17, 460. [Google Scholar] [CrossRef] [Green Version]
- Kang, G.; Liu, W.; Cheng, C.; Wilson, C.L.; Neale, G.; Yang, J.J.; Ness, K.K.; Robison, L.L.; Hudson, M.M.; Srivastava, D.K. Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies. J. Hum. Genet. 2015, 60, 729–738. [Google Scholar] [CrossRef] [Green Version]
- Triba, M.N.; Le Moyec, L.; Amathieu, R.; Goossens, C.; Bouchemal, N.; Nahon, P.; Rutledge, D.N.; Savarin, P. PLS/OPLS models in metabolomics: The impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol. BioSyst. 2015, 11, 13–19. [Google Scholar] [CrossRef]
- Gijsberts, C.M.; Seneviratna, A.; Bank, I.E.M.; Den Ruijter, H.M.; Asselbergs, F.W.; Agostoni, P.; Remijn, J.A.; Pasterkamp, G.; Kiat, H.C.; Roest, M.; et al. The ethnicity-specific association of biomarkers with the angiographic severity of coronary artery disease. Neth. Heart J. 2016, 24, 188–198. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chanock, S.J.; Manolio, T.; Boehnke, M.; Boerwinkle, E.; Hunter, D.J.; Thomas, G.; Hirschhorn, J.N.; Abecasis, G.; Altshuler, D.; Bailey-Wilson, J.E.; et al. Studies N-NWGoRiA. Replicating genotype–phenotype associations. Nature 2007, 447, 655–660. [Google Scholar] [CrossRef] [PubMed]
- Gallagher, M.D.; Chen-Plotkin, A.S. The Post-GWAS Era: From Association to Function. Am. J. Hum. Genet. 2018, 102, 717–730. [Google Scholar] [CrossRef] [PubMed]
- Allyse, M.A.; Robinson, D.H.; Ferber, M.J.; Sharp, R.R. Direct-to-Consumer Testing 2.0: Emerging Models of Direct-to-Consumer Genetic Testing. Mayo Clin. Proc. 2018, 93, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Wu, Y.; Perng, W.; Peterson, K.E. Precition nutrition and childhood obesity: A scoping review. Metabolites 2020, 10, 235. [Google Scholar] [CrossRef]
- Baker, M. 1500 scientists lift the lid on reproducibility. Nature 2016, 533, 452–454. [Google Scholar] [CrossRef] [Green Version]
- Fanelli, D. Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proc. Natl. Acad. Sci. USA 2018, 115, 2628–2631. [Google Scholar] [CrossRef] [Green Version]
- Fiehn, O.; Robertson, D.; Griffin, J.; Van der Werf, M.; Nikolau, B.; Morrison, N.; Sumner, L.W.; Goodacre, R.; Hardy, N.W.; Taylor, C.; et al. The metabolomics standards initiative (MSI). Metab. Off. J. Metab. Soc. 2007, 3, 175–178. [Google Scholar] [CrossRef]
- Sumner, L.W.; Amberg, A.; Barrett, D.; Beale, M.H.; Beger, R.; Daykin, C.A.; Fan, T.W.; Fiehn, O.; Goodacre, R.; Griffin, J.L.; et al. Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metab. Off. J. Metab. Soc. 2007, 3, 211–221. [Google Scholar] [CrossRef] [Green Version]
- Ferreira, J.D.; Inácio, B.; Salek, R.M.; Couto, F.M. Assessing Public Metabolomics Metadata, Towards Improving Quality. J. Integr. Bioinform. 2017, 14. [Google Scholar] [CrossRef]
- Inácio, B.; Ferreira, J.D.; Couto, F.M. (Eds.) Metadata analyser: Measuring metadata quality. In Proceedings of the 11th International Conference on Practical Applications of Computational Biology & Bioinformatics, Porto, Portugal, 21–23 June 2017; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
- Shin, S.Y.; Fauman, E.B.; Petersen, A.K.; Krumsiek, J.; Santos, R.; Huang, J.; Arnold, M.; Erte, I.; Forgetta, V.; Yang, T.P.; et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 2014, 46, 543–550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Krumsiek, J.; Suhre, K.; Evans, A.M.; Mitchell, M.W.; Mohney, R.P.; Milburn, M.V.; Wägele, B.; Römisch-Margl, W.; Illig, T.; Adamski, J.; et al. Mining the unknown: A systems approach to metabolite identification combining genetic and metabolic information. PLoS Genet. 2012, 8, e1003005. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Krumsiek, J.; Mittelstrass, K.; Do, K.T.; Stückler, F.; Ried, J.; Adamski, J.; Peters, A.; Illig, T.; Kronenberg, F.; Friedrich, N.; et al. Gender-specific pathway differences in the human serum metabolome. Metab. Off. J. Metab. Soc. 2015, 11, 1815–1833. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Both |
● A priori calculations of the sample size required to detect a realistic effect. |
● Use of publicly available ‘omics datasets and/or pursuit of collaborations with other cohorts/consortia to maximize statistical power. |
● Stringent and appropriate corrections for multiple testing. |
● Transparent reporting of all relevant methods (from the laboratory work to bioinformatics pipelines, to data cleaning, to formal data analysis), features, and results (including those that failed to establish reproducibility). |
Validation |
● Harmonizing data across the discovery and validation stages to reduce the likelihood of non-reproducible findings due to systemic differences. |
● Inclusion of diverse datasets in validation efforts and the use of appropriate statistical methods to account for the resulting heterogeneity. |
● Judicious incorporation of functional annotations and effect sizes (in addition to statistical significance) when selecting features for validation and interpreting findings. |
● Distinguishing between reproducibility, functional relevance, and predictive validity, and using the appropriate metrics for each. |
Replication |
● Original (discovery) and confirmatory (replication) populations should be similar in terms of sex, age, and race/ethnic distributions. |
● Use of identical laboratory procedures, data processing pipelines, and analytical approaches. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Perng, W.; Aslibekyan, S. Find the Needle in the Haystack, Then Find It Again: Replication and Validation in the ‘Omics Era. Metabolites 2020, 10, 286. https://doi.org/10.3390/metabo10070286
Perng W, Aslibekyan S. Find the Needle in the Haystack, Then Find It Again: Replication and Validation in the ‘Omics Era. Metabolites. 2020; 10(7):286. https://doi.org/10.3390/metabo10070286
Chicago/Turabian StylePerng, Wei, and Stella Aslibekyan. 2020. "Find the Needle in the Haystack, Then Find It Again: Replication and Validation in the ‘Omics Era" Metabolites 10, no. 7: 286. https://doi.org/10.3390/metabo10070286
APA StylePerng, W., & Aslibekyan, S. (2020). Find the Needle in the Haystack, Then Find It Again: Replication and Validation in the ‘Omics Era. Metabolites, 10(7), 286. https://doi.org/10.3390/metabo10070286