Statistical Methods for Genetic Epidemiology

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Molecular Genetics and Genomics".

Deadline for manuscript submissions: 15 December 2024 | Viewed by 21130

Special Issue Editors


E-Mail Website
Guest Editor
Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
Interests: statistical genetics; population genetics; bioinformatics; psychiatric disorder; cancer epidemiology; DNA methylation; molecular phylogenetics
Division of Biostatistics and Bioinformatics and Maryland Psychiatric Research Center, School of Medicine, University of Maryland, Baltimore, MD, USA
Interests: biostatistics; imaging genetics; neuropsychiatric disorder; network analysis
Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC 3010, Australia
Interests: cancer epidemiology; genetic epidemiology; epigenetic epidemiology; cancer risk modeling; twin and family research

E-Mail Website
Guest Editor
Department of Statistics and Data Science, University of Central Florida, Orlando, FL 32816, USA
Interests: big data; machine learning; regularized low-rank matrix models; genomics modeling and analysis; Bayesian ultra-high dimensional variable selection and clustering; spatiotemporal models

Special Issue Information

Dear Colleagues,

Genetic epidemiology, an important area of public health research, has rapidly evolved in the last two decades. This field of study seeks to understand the contribution of genetic factors to health and disease in families and populations and the interplay between genes and environmental factors. Recent advances in high-throughput genomic profiling techniques have brought a sharp increase in “omics” data (genomics, proteomics, transcriptomics, epigenomics, metabolomics, metagenomics, single-cell, etc.) and have accelerated the development of the knowledge and methodologies used to gain a better understanding of the multifactorial causes, distribution, and prediction of inherited diseases in populations.

This Special Issue aims to highlight the latest advances in statistical methods in genetic epidemiology. We encourage researchers to share their original research on developing novel statistical, bioinformatical, and computational approaches or applying advanced statistical techniques to complex traits or diseases. Review papers addressing current advances in this field are also welcome. Topics of primary interest include, but are not limited to:

  • Family and twin studies;
  • Genome-wide association studies;
  • Population genetics;
  • Heritability and genetic correlation;
  • Polygenic risk score;
  • Gene–environment interaction;
  • Multi-omics study;
  • Imaging genetics;
  • Expression quantitative trait loci (eQTLs);
  • Mendelian randomization;
  • Epigenetic epidemiology;
  • Single-cell epidemiology.

Dr. Chenglong Yu
Dr. Shuo Chen
Dr. Shuai Li
Dr. Hsin-Hsiung Huang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • family and twin study
  • GWAS
  • heritability
  • polygenic risk score
  • multi-omics
  • imaging genetics
  • eQTL
  • gene–environment interaction
  • epigenetics
  • Mendelian randomization

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

13 pages, 1081 KiB  
Article
A New Method for Conditional Gene-Based Analysis Effectively Accounts for the Regional Polygenic Background
by Gulnara R. Svishcheva, Nadezhda M. Belonogova, Anatoly V. Kirichenko, Yakov A. Tsepilov and Tatiana I. Axenovich
Genes 2024, 15(9), 1174; https://doi.org/10.3390/genes15091174 - 7 Sep 2024
Viewed by 695
Abstract
Gene-based association analysis is a powerful tool for identifying genes that explain trait variability. An essential step of this analysis is a conditional analysis. It aims to eliminate the influence of SNPs outside the gene, which are in linkage disequilibrium with intragenic SNPs. [...] Read more.
Gene-based association analysis is a powerful tool for identifying genes that explain trait variability. An essential step of this analysis is a conditional analysis. It aims to eliminate the influence of SNPs outside the gene, which are in linkage disequilibrium with intragenic SNPs. The popular conditional analysis method, GCTA-COJO, accounts for the influence of several top independently associated SNPs outside the gene, correcting the z statistics for intragenic SNPs. We suggest a new TauCOR method for conditional gene-based analysis using summary statistics. This method accounts the influence of the full regional polygenic background, correcting the genotype correlations between intragenic SNPs. As a result, the distribution of z statistics for intragenic SNPs becomes conditionally independent of distribution for extragenic SNPs. TauCOR is compatible with any gene-based association test. TauCOR was tested on summary statistics simulated under different scenarios and on real summary statistics for a ‘gold standard’ gene list from the Open Targets Genetics project. TauCOR proved to be effective in all modelling scenarios and on real data. The TauCOR’s strategy showed comparable sensitivity and higher specificity and accuracy than GCTA-COJO on both simulated and real data. The method can be successfully used to improve the effectiveness of gene-based association analyses. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

12 pages, 1000 KiB  
Article
New Virus Variant Detection Based on the Optimal Natural Metric
by Hongyu Yu and Stephen S.-T. Yau
Genes 2024, 15(7), 891; https://doi.org/10.3390/genes15070891 - 7 Jul 2024
Viewed by 3573
Abstract
The highly variable SARS-CoV-2 virus responsible for the COVID-19 pandemic frequently undergoes mutations, leading to the emergence of new variants that present novel threats to public health. The determination of these variants often relies on manual definition based on local sequence characteristics, resulting [...] Read more.
The highly variable SARS-CoV-2 virus responsible for the COVID-19 pandemic frequently undergoes mutations, leading to the emergence of new variants that present novel threats to public health. The determination of these variants often relies on manual definition based on local sequence characteristics, resulting in delays in their detection relative to their actual emergence. In this study, we propose an algorithm for the automatic identification of novel variants. By leveraging the optimal natural metric for viruses based on an alignment-free perspective to measure distances between sequences, we devise a hypothesis testing framework to determine whether a given viral sequence belongs to a novel variant. Our method demonstrates high accuracy, achieving nearly 100% precision in identifying new variants of SARS-CoV-2 and HIV-1 as well as in detecting novel genera in Orthocoronavirinae. This approach holds promise for timely surveillance and management of emerging viral threats in the field of public health. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

14 pages, 4224 KiB  
Article
A Polygenic Risk Analysis for Identifying Ulcerative Colitis Patients with European Ancestry
by Ling Liu, Yiming Wu, Yizhou Li and Menglong Li
Genes 2024, 15(6), 684; https://doi.org/10.3390/genes15060684 - 25 May 2024
Cited by 1 | Viewed by 923
Abstract
The incidence of ulcerative colitis (UC) has increased globally. As a complex disease, the genetic predisposition for UC could be estimated by the polygenic risk score (PRS), which aggregates the effects of a large number of genetic variants in a single quantity and [...] Read more.
The incidence of ulcerative colitis (UC) has increased globally. As a complex disease, the genetic predisposition for UC could be estimated by the polygenic risk score (PRS), which aggregates the effects of a large number of genetic variants in a single quantity and shows promise in identifying individuals at higher lifetime risk of UC. Here, based on a cohort of 2869 UC cases and 2900 controls with genotype array datasets, we used PRSice-2 to calculate PRS, and systematically analyzed factors that could affect the power of PRS, including GWAS summary statistics, population stratification, and impact of variants. After leveraging a stepwise condition analysis, we eventually established the best PRS model, achieving an AUC of 0.713. Meanwhile, samples in the top 20% of the PRS distribution had a risk of UC more than ten times higher than samples in the lowest 20% (OR = 10.435, 95% CI 8.571–12.703). Our analyses demonstrated that including population-enriched, more disease-associated SNPs and using GWAS summary statistics from similar ethnic background can improve the power of PRS. Strictly following the principle of focusing on one population in all aspects of generating PRS can be a cost-effective way to apply genotype-array-derived PRS to practical risk estimation. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

13 pages, 1424 KiB  
Article
Multiple Sclerosis Heritability Estimation on Sardinian Ascertained Extended Families Using Bayesian Liability Threshold Model
by Andrea Nova, Teresa Fazia, Valeria Saddi, Marialuisa Piras and Luisa Bernardinelli
Genes 2023, 14(8), 1579; https://doi.org/10.3390/genes14081579 - 2 Aug 2023
Cited by 1 | Viewed by 1449
Abstract
Heritability studies represent an important tool to investigate the main sources of variability for complex diseases, whose etiology involves both genetics and environmental factors. In this paper, we aimed to estimate multiple sclerosis (MS) narrow-sense heritability (h2), on a liability scale, [...] Read more.
Heritability studies represent an important tool to investigate the main sources of variability for complex diseases, whose etiology involves both genetics and environmental factors. In this paper, we aimed to estimate multiple sclerosis (MS) narrow-sense heritability (h2), on a liability scale, using extended families ascertained from affected probands sampled in the Sardinian province of Nuoro, Italy. We also investigated the sources of MS liability variability among shared environment effects, sex, and categorized year of birth (<1946, ≥1946). The latter can be considered a proxy for different early environmental exposures. To this aim, we implemented a Bayesian liability threshold model to obtain posterior distributions for the parameters of interest adjusting for ascertainment bias. Our analysis highlighted categorized year of birth as the main explanatory factor, explaining ~70% of MS liability variability (median value = 0.69, 95% CI: 0.64, 0.73), while h2 resulted near to 0% (median value = 0.03, 95% CI: 0.00, 0.09). By performing a year of birth-stratified analysis, we found a high h2 only in individuals born on/after 1946 (median value = 0.82, 95% CI: 0.68, 0.93), meaning that the genetic variability acquired a high explanatory role only when focusing on this subpopulation. Overall, the results obtained highlighted early environmental exposures, in the Sardinian population, as a meaningful factor involved in MS to be further investigated. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

17 pages, 942 KiB  
Article
Exploring the Lifetime Effect of Children on Wellbeing Using Two-Sample Mendelian Randomisation
by Benjamin Woolf, Hannah M. Sallis and Marcus R. Munafò
Genes 2023, 14(3), 716; https://doi.org/10.3390/genes14030716 - 14 Mar 2023
Cited by 3 | Viewed by 2422
Abstract
Background: Observational research implies a negative effect of having children on wellbeing. Objectives: To provide Mendelian randomisation evidence of the effect of having children on parental wellbeing. Design: Two-sample Mendelian randomisation. Setting: Non-clinical European ancestry participants. Participants: We used the UK Biobank (460,654 [...] Read more.
Background: Observational research implies a negative effect of having children on wellbeing. Objectives: To provide Mendelian randomisation evidence of the effect of having children on parental wellbeing. Design: Two-sample Mendelian randomisation. Setting: Non-clinical European ancestry participants. Participants: We used the UK Biobank (460,654 male and female European ancestry participants) as a source of genotype-exposure associations, the Social Science Genetics Consortia (SSGAC) (298,420 male and female European ancestry participants), and the Within-Family Consortia (effective sample of 22,656 male and female European ancestry participants) as sources of genotype-outcome associations. Interventions: The lifetime effect of an increase in the genetic liability to having children. Primary and secondary outcome measures: The primary analysis was an inverse variance weighed analysis of subjective wellbeing measured in the 2016 SSGAC Genome Wide Association Study (GWAS). Secondary outcomes included pleiotropy robust estimators applied in the SSGAC and an analysis using the Within-Family consortia GWAS. Results: We did not find strong evidence of a negative (standard deviation) change in wellbeing (β = 0.153 (95% CI: −0.210 to 0.516) per child parented. Secondary outcomes were generally slightly deflated (e.g., −0.049 [95% CI: −0.533 to 0.435] for the Within-Family Consortia and 0.090 [95% CI: −0.167 to 0.347] for weighted median), implying the presence of some residual confounding and pleiotropy. Conclusions: Contrary to the existing literature, our results are not compatible with a measurable negative effect of number of children on the average wellbeing of a parent over their life course. However, we were unable to explore non-linearities, interactions, or time-varying effects. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

16 pages, 555 KiB  
Article
Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
by Niloufar Dousti Mousavi, Jie Yang and Hani Aldirawi
Genes 2023, 14(2), 403; https://doi.org/10.3390/genes14020403 - 3 Feb 2023
Cited by 5 | Viewed by 1904
Abstract
Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. [...] Read more.
Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of Lactobacillus species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as 83% of the model variability. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

11 pages, 298 KiB  
Article
Generating Minimal Models of H1N1 NS1 Gene Sequences Using Alignment-Based and Alignment-Free Algorithms
by Meng Fang, Jiawei Xu, Nan Sun and Stephen S.-T. Yau
Genes 2023, 14(1), 186; https://doi.org/10.3390/genes14010186 - 10 Jan 2023
Cited by 1 | Viewed by 1377
Abstract
For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal [...] Read more.
For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal model for a group of gene sequences is to find their longest common sequence (LCS), but this is a non-deterministic polynomial-time hard (NP-hard) problem. Therefore, we applied some heuristic approaches of finding LCS, as well as some of the newer methods of treating gene sequences, including multiple sequence alignment (MSA) and k-mer natural vector (NV) encoding. To evaluate our algorithms, a five-fold cross validation classification scheme on a dataset of H1N1 virus non-structural protein 1 (NS1) gene was analyzed. The results indicate that the MSA-based algorithm has the best performance measured by classification accuracy, while the NV-based algorithm exhibits advantages in the time complexity of generating minimal models. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

15 pages, 350 KiB  
Article
Clustering Gene Expressions Using the Table Invitation Prior
by Charles W. Harrison, Qing He and Hsin-Hsiung Huang
Genes 2022, 13(11), 2036; https://doi.org/10.3390/genes13112036 - 4 Nov 2022
Cited by 3 | Viewed by 1907
Abstract
A prior for Bayesian nonparametric clustering called the Table Invitation Prior (TIP) is used to cluster gene expression data. TIP uses information concerning the pairwise distances between subjects (e.g., gene expression samples) and automatically estimates the number of clusters. TIP’s hyperparameters are estimated [...] Read more.
A prior for Bayesian nonparametric clustering called the Table Invitation Prior (TIP) is used to cluster gene expression data. TIP uses information concerning the pairwise distances between subjects (e.g., gene expression samples) and automatically estimates the number of clusters. TIP’s hyperparameters are estimated using a univariate multiple change point detection algorithm with respect to the subject distances, and thus TIP does not require an analyst’s intervention for estimating hyperparameters. A Gibbs sampling algorithm is provided, and TIP is used in conjunction with a Normal-Inverse-Wishart likelihood to cluster 801 gene expression samples, each of which belongs to one of five different types of cancer. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

Review

Jump to: Research

18 pages, 644 KiB  
Review
Using Genetics to Investigate Relationships between Phenotypes: Application to Endometrial Cancer
by Kelsie Bouttle, Nathan Ingold and Tracy A. O’Mara
Genes 2024, 15(7), 939; https://doi.org/10.3390/genes15070939 - 18 Jul 2024
Viewed by 1285
Abstract
Genome-wide association studies (GWAS) have accelerated the exploration of genotype–phenotype associations, facilitating the discovery of replicable genetic markers associated with specific traits or complex diseases. This narrative review explores the statistical methodologies developed using GWAS data to investigate relationships between various phenotypes, focusing [...] Read more.
Genome-wide association studies (GWAS) have accelerated the exploration of genotype–phenotype associations, facilitating the discovery of replicable genetic markers associated with specific traits or complex diseases. This narrative review explores the statistical methodologies developed using GWAS data to investigate relationships between various phenotypes, focusing on endometrial cancer, the most prevalent gynecological malignancy in developed nations. Advancements in analytical techniques such as genetic correlation, colocalization, cross-trait locus identification, and causal inference analyses have enabled deeper exploration of associations between different phenotypes, enhancing statistical power to uncover novel genetic risk regions. These analyses have unveiled shared genetic associations between endometrial cancer and many phenotypes, enabling identification of novel endometrial cancer risk loci and furthering our understanding of risk factors and biological processes underlying this disease. The current status of research in endometrial cancer is robust; however, this review demonstrates that further opportunities exist in statistical genetics that hold promise for advancing the understanding of endometrial cancer and other complex diseases. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

32 pages, 810 KiB  
Review
Computational Prediction of Protein Intrinsically Disordered Region Related Interactions and Functions
by Bingqing Han, Chongjiao Ren, Wenda Wang, Jiashan Li and Xinqi Gong
Genes 2023, 14(2), 432; https://doi.org/10.3390/genes14020432 - 8 Feb 2023
Cited by 5 | Viewed by 3941
Abstract
Intrinsically Disordered Proteins (IDPs) and Regions (IDRs) exist widely. Although without well-defined structures, they participate in many important biological processes. In addition, they are also widely related to human diseases and have become potential targets in drug discovery. However, there is a big [...] Read more.
Intrinsically Disordered Proteins (IDPs) and Regions (IDRs) exist widely. Although without well-defined structures, they participate in many important biological processes. In addition, they are also widely related to human diseases and have become potential targets in drug discovery. However, there is a big gap between the experimental annotations related to IDPs/IDRs and their actual number. In recent decades, the computational methods related to IDPs/IDRs have been developed vigorously, including predicting IDPs/IDRs, the binding modes of IDPs/IDRs, the binding sites of IDPs/IDRs, and the molecular functions of IDPs/IDRs according to different tasks. In view of the correlation between these predictors, we have reviewed these prediction methods uniformly for the first time, summarized their computational methods and predictive performance, and discussed some problems and perspectives. Full article
(This article belongs to the Special Issue Statistical Methods for Genetic Epidemiology)
Show Figures

Figure 1

Back to TopTop