Biomolecular Data Science—in Honor of Professor Philip E. Bourne

A special issue of Biomolecules (ISSN 2218-273X). This special issue belongs to the section "Bioinformatics and Systems Biology".

Deadline for manuscript submissions: closed (31 December 2022) | Viewed by 48060

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Department of Biomedical Engineering, School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
Interests: computational biology; structural biology; molecular simulations; molecular evolution; machine learning

E-Mail Website
Guest Editor
1. Department of Computer Science, Hunter College & The Graduate Center, The City University of New York, New York, NY 10065, USA
2. Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, NY 10021, USA
Interests: bioinformatics; machine learning; biomedical data; computational drug design; systems biology

Special Issue Information

Dear Colleagues,

Prof Philip E. Bourne (‘Phil’), the founding Dean of the School of Data Science at the University of Virginia, has spent his career exploring and helping to define the intersection of biomolecules and computation—as a practicing scientist and as an academic leader, as well as in conjunction with industry and government. In those 40 years, our knowledge of biomolecular structure, function and evolution (in both health and disease) has rapidly advanced, often with exponential growth. What enabled that? The advances were enabled, in no small part, by Phil’s highly collaborative and foundational work, where two pervasive themes have been (i) the key role of three-dimensional structure, as a bridge between a biomolecule’s sequence and its function, and (ii) computational methodologies and resources, including development of state-of-the-art databases (notably the RCSB Protein Data Bank) and associated data-formats (e.g., the macromolecular crystallographic information file), creating standards and interoperable tools, and developing algorithms such as the widely used combinatorial extension (CE) method for structure alignment. Alongside these foundational, ‘basic research’ advances, Phil’s work and its applications have significantly impacted a vast array of biological domains, including early stage drug discovery, molecular evolution, immunology and more—resulting in over 300 papers and two related books. All the while, Phil has been unwavering in his intense support of public service in government and academia, in open scholarship and research best-practices, and in the professional development of all who have crossed his path, at all levels (students, peers, colleagues).

Anyone who’s worked with Phil has seen that a notable trait in his approach to biosciences (and now data science) is that it is expansive and forward-looking—in a word, ‘visionary’. Phil’s focus in recent years, as it relates to this readership, has turned to “biomedical data sciences”, which can be viewed as the natural evolution (and synthesis) of bioinformatics, computational biology, systems biology, and other allied fields. This Special Issue honors Phil by trying to capture his vision as it relates to biomolecules: how this vision arose, what it can encompass, and with an invitation for reviews and original research papers that share the spirit of that vision.

That vision can be expressed as four elements of data science: systems, design, analysis and value. Biomolecular systems, in a computational sense, relate to underlying infrastructure such as data structures, ontologies, software libraries/tools, etc., that enable discovery. Biomolecular analysis, of late, is dominated by machine learning approaches such as deep learning (for which systems to access training data are critical). In our data science context, design can refer to human–computer interaction, for example, where biomolecular visualization plays a vital role. Lastly, value relates to maximizing the benefit that research has on those it is intended to serve. This issue’s papers exemplify what a field of “biomolecular data sciences” can represent, as a fitting tribute to someone who has endeavored to move the field forward—both via his own work and by his steadfast support of the work of others in the field more broadly.

Dr. Cameron Mura
Prof. Dr. Lei Xie
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Biomolecules is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2700 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data science;
  • deep learning;
  • machine learning;
  • structural bioinformatics;
  • biophysics;
  • systems biology

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review, Other

5 pages, 720 KiB  
Editorial
A Tribute to Phil Bourne—Scientist and Human
by Cameron Mura, Emma Candelier and Lei Xie
Biomolecules 2023, 13(1), 181; https://doi.org/10.3390/biom13010181 - 16 Jan 2023
Viewed by 2387
Abstract
This Special Issue of Biomolecules[...] Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

Research

Jump to: Editorial, Review, Other

18 pages, 2761 KiB  
Article
DL-TODA: A Deep Learning Tool for Omics Data Analysis
by Cecile M. Cres, Andrew Tritt, Kristofer E. Bouchard and Ying Zhang
Biomolecules 2023, 13(4), 585; https://doi.org/10.3390/biom13040585 - 24 Mar 2023
Cited by 3 | Viewed by 3169
Abstract
Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a [...] Read more.
Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

23 pages, 2110 KiB  
Article
Using GPT-3 to Build a Lexicon of Drugs of Abuse Synonyms for Social Media Pharmacovigilance
by Kristy A. Carpenter and Russ B. Altman
Biomolecules 2023, 13(2), 387; https://doi.org/10.3390/biom13020387 - 18 Feb 2023
Cited by 9 | Viewed by 5052
Abstract
Drug abuse is a serious problem in the United States, with over 90,000 drug overdose deaths nationally in 2020. A key step in combating drug abuse is detecting, monitoring, and characterizing its trends over time and location, also known as pharmacovigilance. While federal [...] Read more.
Drug abuse is a serious problem in the United States, with over 90,000 drug overdose deaths nationally in 2020. A key step in combating drug abuse is detecting, monitoring, and characterizing its trends over time and location, also known as pharmacovigilance. While federal reporting systems accomplish this to a degree, they often have high latency and incomplete coverage. Social-media-based pharmacovigilance has zero latency, is easily accessible and unfiltered, and benefits from drug users being willing to share their experiences online pseudo-anonymously. However, unlike highly structured official data sources, social media text is rife with misspellings and slang, making automated analysis difficult. Generative Pretrained Transformer 3 (GPT-3) is a large autoregressive language model specialized for few-shot learning that was trained on text from the entire internet. We demonstrate that GPT-3 can be used to generate slang and common misspellings of terms for drugs of abuse. We repeatedly queried GPT-3 for synonyms of drugs of abuse and filtered the generated terms using automated Google searches and cross-references to known drug names. When generated terms for alprazolam were manually labeled, we found that our method produced 269 synonyms for alprazolam, 221 of which were new discoveries not included in an existing drug lexicon for social media. We repeated this process for 98 drugs of abuse, of which 22 are widely-discussed drugs of abuse, building a lexicon of colloquial drug synonyms that can be used for pharmacovigilance on social media. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

20 pages, 4489 KiB  
Article
KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units
by Tolulope Adeyelu, Nicola Bordin, Vaishali P. Waman, Marta Sadlej, Ian Sillitoe, Aurelio A. Moya-Garcia and Christine A. Orengo
Biomolecules 2023, 13(2), 277; https://doi.org/10.3390/biom13020277 - 2 Feb 2023
Cited by 5 | Viewed by 3535
Abstract
Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising [...] Read more.
Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

8 pages, 1296 KiB  
Article
Entropy and Variability: A Second Opinion by Deep Learning
by Daniel T. Rademaker, Li C. Xue, Peter A. C. ‘t Hoen and Gert Vriend
Biomolecules 2022, 12(12), 1740; https://doi.org/10.3390/biom12121740 - 23 Nov 2022
Cited by 2 | Viewed by 1596
Abstract
Background: Analysis of the distribution of amino acid types found at equivalent positions in multiple sequence alignments has found applications in human genetics, protein engineering, drug design, protein structure prediction, and many other fields. These analyses tend to revolve around measures of the [...] Read more.
Background: Analysis of the distribution of amino acid types found at equivalent positions in multiple sequence alignments has found applications in human genetics, protein engineering, drug design, protein structure prediction, and many other fields. These analyses tend to revolve around measures of the distribution of the twenty amino acid types found at evolutionary equivalent positions: the columns in multiple sequence alignments. Commonly used measures are variability, average hydrophobicity, or Shannon entropy. One of these techniques, called entropy–variability analysis, as the name already suggests, reduces the distribution of observed residue types in one column to two numbers: the Shannon entropy and the variability as defined by the number of residue types observed. Results: We applied a deep learning, unsupervised feature extraction method to analyse the multiple sequence alignments of all human proteins. An auto-encoder neural architecture was trained on 27,835 multiple sequence alignments for human proteins to obtain the two features that best describe the seven million variability patterns. These two unsupervised learned features strongly resemble entropy and variability, indicating that these are the projections that retain most information when reducing the dimensionality of the information hidden in columns in multiple sequence alignments. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

18 pages, 20922 KiB  
Article
HBcompare: Classifying Ligand Binding Preferences with Hydrogen Bond Topology
by Justin Z. Tam, Zhaoming Kong, Omar Ahmed, Lifang He and Brian Y. Chen
Biomolecules 2022, 12(11), 1589; https://doi.org/10.3390/biom12111589 - 28 Oct 2022
Viewed by 1330
Abstract
This paper presents HBcompare, a method that classifies protein structures according to ligand binding preference categories by analyzing hydrogen bond topology. HBcompare excludes other characteristics of protein structure so that, in the event of accurate classification, it can implicate the involvement of hydrogen [...] Read more.
This paper presents HBcompare, a method that classifies protein structures according to ligand binding preference categories by analyzing hydrogen bond topology. HBcompare excludes other characteristics of protein structure so that, in the event of accurate classification, it can implicate the involvement of hydrogen bonds in selective binding. This approach contrasts from methods that represent many aspects of protein structure because holistic representations cannot associate classification with just one characteristic. To our knowledge, HBcompare is the first technique with this capability. On five datasets of proteins that catalyze similar reactions with different preferred ligands, HBcompare correctly categorized proteins with similar ligand binding preferences 89.5% of the time. Using only hydrogen bond topology, classification accuracy with HBcompare surpassed standard structure-based comparison algorithms that use atomic coordinates. As a tool for implicating the role of hydrogen bonds in protein function categories, HBcompare represents a first step towards the automatic explanation of biochemical mechanisms. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

18 pages, 1973 KiB  
Article
Generalization Performance of Quantum Metric Learning Classifiers
by Jonathan Kim and Stefan Bekiranov
Biomolecules 2022, 12(11), 1576; https://doi.org/10.3390/biom12111576 - 27 Oct 2022
Cited by 2 | Viewed by 2674
Abstract
Quantum computing holds great promise for a number of fields including biology and medicine. A major application in which quantum computers could yield advantage is machine learning, especially kernel-based approaches. A recent method termed quantum metric learning, in which a quantum embedding which [...] Read more.
Quantum computing holds great promise for a number of fields including biology and medicine. A major application in which quantum computers could yield advantage is machine learning, especially kernel-based approaches. A recent method termed quantum metric learning, in which a quantum embedding which maximally separates data into classes is learned, was able to perfectly separate ant and bee image training data. The separation is achieved with an intrinsically quantum objective function and the overall approach was shown to work naturally as a hybrid classical-quantum computation enabling embedding of high dimensional feature data into a small number of qubits. However, the ability of the trained classifier to predict test sample data was never assessed. We assessed the performance of quantum metric learning on test ants and bees image data as well as breast cancer clinical data. We applied the original approach as well as variants in which we performed principal component analysis (PCA) on the feature data to reduce its dimensionality for quantum embedding, thereby limiting the number of model parameters. If the degree of dimensionality reduction was limited and the number of model parameters was constrained to be far less than the number of training samples, we found that quantum metric learning was able to accurately classify test data. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

18 pages, 1860 KiB  
Article
The Pharmacorank Search Tool for the Retrieval of Prioritized Protein Drug Targets and Drug Repositioning Candidates According to Selected Diseases
by Sergey Gnilopyat, Paul J. DePietro, Thomas K. Parry and William A. McLaughlin
Biomolecules 2022, 12(11), 1559; https://doi.org/10.3390/biom12111559 - 26 Oct 2022
Cited by 1 | Viewed by 3564
Abstract
We present the Pharmacorank search tool as an objective means to obtain prioritized protein drug targets and their associated medications according to user-selected diseases. This tool could be used to obtain prioritized protein targets for the creation of novel medications or to predict [...] Read more.
We present the Pharmacorank search tool as an objective means to obtain prioritized protein drug targets and their associated medications according to user-selected diseases. This tool could be used to obtain prioritized protein targets for the creation of novel medications or to predict novel indications for medications that already exist. To prioritize the proteins associated with each disease, a gene similarity profiling method based on protein functions is implemented. The priority scores of the proteins are found to correlate well with the likelihoods that the associated medications are clinically relevant in the disease’s treatment. When the protein priority scores are plotted against the percentage of protein targets that are known to bind medications currently indicated to treat the disease, which we termed the pertinency score, a strong correlation was observed. The correlation coefficient was found to be 0.9978 when using a weighted second-order polynomial fit. As the highly predictive fit was made using a broad range of diseases, we were able to identify a general threshold for the pertinency score as a starting point for considering drug repositioning candidates. Several repositioning candidates are described for proteins that have high predicated pertinency scores, and these provide illustrative examples of the applications of the tool. We also describe focused reviews of repositioning candidates for Alzheimer’s disease. Via the tool’s URL, https://protein.som.geisinger.edu/Pharmacorank/, an open online interface is provided for interactive use; and there is a site for programmatic access. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Graphical abstract

13 pages, 834 KiB  
Article
Mutational Signatures as Sensors of Environmental Exposures: Analysis of Smoking-Induced Lung Tissue Remodeling
by Yoo-Ah Kim, Ermin Hodzic, Bayarbaatar Amgalan, Ariella Saslafsky, Damian Wojtowicz and Teresa M. Przytycka
Biomolecules 2022, 12(10), 1384; https://doi.org/10.3390/biom12101384 - 27 Sep 2022
Cited by 3 | Viewed by 2604
Abstract
Smoking is a widely recognized risk factor in the emergence of cancers and other lung diseases. Studies of non-cancer lung diseases typically investigate the role that smoking has in chronic changes in lungs that might predispose patients to the diseases, whereas most cancer [...] Read more.
Smoking is a widely recognized risk factor in the emergence of cancers and other lung diseases. Studies of non-cancer lung diseases typically investigate the role that smoking has in chronic changes in lungs that might predispose patients to the diseases, whereas most cancer studies focus on the mutagenic properties of smoking. Large-scale cancer analysis efforts have collected expression data from both tumor and control lung tissues, and studies have used control samples to estimate the impact of smoking on gene expression. However, such analyses may be confounded by tumor-related micro-environments as well as patient-specific exposure to smoking. Thus, in this paper, we explore the utilization of mutational signatures to study environment-induced changes of gene expression in control lung tissues from lung adenocarcinoma samples. We show that a joint computational analysis of mutational signatures derived from sequenced tumor samples, and the gene expression obtained from control samples, can shed light on the combined impact that smoking and tumor-related micro-environments have on gene expression and cell-type composition in non-neoplastic (control) lung tissue. The results obtained through such analysis are both supported by experimental studies, including studies utilizing single-cell technology, and also suggest additional novel insights. We argue that the study provides a proof of principle of the utility of mutational signatures to be used as sensors of environmental exposures not only in the context of the mutational landscape of cancer, but also as a reference for changes in non-cancer lung tissues. It also provides an example of how a database collected with the purpose of understanding cancer can provide valuable information for studies not directly related to the disease. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

15 pages, 626 KiB  
Article
RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction
by Chaochao Yan, Peilin Zhao, Chan Lu, Yang Yu and Junzhou Huang
Biomolecules 2022, 12(9), 1325; https://doi.org/10.3390/biom12091325 - 19 Sep 2022
Cited by 11 | Viewed by 2626
Abstract
The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose [...] Read more.
The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose an innovative retrosynthesis prediction framework that can compose novel templates beyond training templates. As far as we know, this is the first method that uses machine learning to compose reaction templates for retrosynthesis prediction. Besides, we propose an effective reactant candidate scoring model that can capture atom-level transformations, which helps our method outperform previous methods on the USPTO-50K dataset. Experimental results show that our method can produce novel templates for 15 USPTO-50K test reactions that are not covered by training templates. We have released our source implementation. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

15 pages, 3128 KiB  
Article
A Machine Learning Approach to Identify the Importance of Novel Features for CRISPR/Cas9 Activity Prediction
by Dhvani Sandip Vora, Yugesh Verma and Durai Sundar
Biomolecules 2022, 12(8), 1123; https://doi.org/10.3390/biom12081123 - 16 Aug 2022
Cited by 4 | Viewed by 2592
Abstract
The reprogrammable CRISPR/Cas9 genome editing tool’s growing popularity is hindered by unwanted off-target effects. Efforts have been directed toward designing efficient guide RNAs as well as identifying potential off-target threats, yet factors that determine efficiency and off-target activity remain obscure. Based on sequence [...] Read more.
The reprogrammable CRISPR/Cas9 genome editing tool’s growing popularity is hindered by unwanted off-target effects. Efforts have been directed toward designing efficient guide RNAs as well as identifying potential off-target threats, yet factors that determine efficiency and off-target activity remain obscure. Based on sequence features, previous machine learning models performed poorly on new datasets, thus there is a need for the incorporation of novel features. The binding energy estimation of the gRNA-DNA hybrid as well as the Cas9-gRNA-DNA hybrid allowed generating better performing machine learning models for the prediction of Cas9 activity. The analysis of feature contribution towards the model output on a limited dataset indicated that energy features played a determining role along with the sequence features. The binding energy features proved essential for the prediction of on-target activity and off-target sites. The plateau, in the performance on unseen datasets, of current machine learning models could be overcome by incorporating novel features, such as binding energy, among others. The models are provided on GitHub (GitHub Inc., San Francisco, CA, USA). Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

17 pages, 2729 KiB  
Article
GraphSite: Ligand Binding Site Classification with Deep Graph Learning
by Wentao Shi, Manali Singha, Limeng Pu, Gopal Srivastava, Jagannathan Ramanujam and Michal Brylinski
Biomolecules 2022, 12(8), 1053; https://doi.org/10.3390/biom12081053 - 29 Jul 2022
Cited by 7 | Viewed by 4253
Abstract
The binding of small organic molecules to protein targets is fundamental to a wide array of cellular functions. It is also routinely exploited to develop new therapeutic strategies against a variety of diseases. On that account, the ability to effectively detect and classify [...] Read more.
The binding of small organic molecules to protein targets is fundamental to a wide array of cellular functions. It is also routinely exploited to develop new therapeutic strategies against a variety of diseases. On that account, the ability to effectively detect and classify ligand binding sites in proteins is of paramount importance to modern structure-based drug discovery. These complex and non-trivial tasks require sophisticated algorithms from the field of artificial intelligence to achieve a high prediction accuracy. In this communication, we describe GraphSite, a deep learning-based method utilizing a graph representation of local protein structures and a state-of-the-art graph neural network to classify ligand binding sites. Using neural weighted message passing layers to effectively capture the structural, physicochemical, and evolutionary characteristics of binding pockets mitigates model overfitting and improves the classification accuracy. Indeed, comprehensive cross-validation benchmarks against a large dataset of binding pockets belonging to 14 diverse functional classes demonstrate that GraphSite yields the class-weighted F1-score of 81.7%, outperforming other approaches such as molecular docking and binding site matching. Further, it also generalizes well to unseen data with the F1-score of 70.7%, which is the expected performance in real-world applications. We also discuss new directions to improve and extend GraphSite in the future. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

Review

Jump to: Editorial, Research, Other

27 pages, 4479 KiB  
Review
Protein Data Bank: A Comprehensive Review of 3D Structure Holdings and Worldwide Utilization by Researchers, Educators, and Students
by Stephen K. Burley, Helen M. Berman, Jose M. Duarte, Zukang Feng, Justin W. Flatt, Brian P. Hudson, Robert Lowe, Ezra Peisach, Dennis W. Piehl, Yana Rose, Andrej Sali, Monica Sekharan, Chenghua Shao, Brinda Vallat, Maria Voigt, John D. Westbrook, Jasmine Y. Young and Christine Zardecki
Biomolecules 2022, 12(10), 1425; https://doi.org/10.3390/biom12101425 - 4 Oct 2022
Cited by 41 | Viewed by 6845
Abstract
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), funded by the United States National Science Foundation, National Institutes of Health, and Department of Energy, supports structural biologists and Protein Data Bank (PDB) data users around the world. The RCSB PDB, [...] Read more.
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), funded by the United States National Science Foundation, National Institutes of Health, and Department of Energy, supports structural biologists and Protein Data Bank (PDB) data users around the world. The RCSB PDB, a founding member of the Worldwide Protein Data Bank (wwPDB) partnership, serves as the US data center for the global PDB archive housing experimentally-determined three-dimensional (3D) structure data for biological macromolecules. As the wwPDB-designated Archive Keeper, RCSB PDB is also responsible for the security of PDB data and weekly update of the archive. RCSB PDB serves tens of thousands of data depositors (using macromolecular crystallography, nuclear magnetic resonance spectroscopy, electron microscopy, and micro-electron diffraction) annually working on all permanently inhabited continents. RCSB PDB makes PDB data available from its research-focused web portal at no charge and without usage restrictions to many millions of PDB data consumers around the globe. It also provides educators, students, and the general public with an introduction to the PDB and related training materials through its outreach and education-focused web portal. This review article describes growth of the PDB, examines evolution of experimental methods for structure determination viewed through the lens of the PDB archive, and provides a detailed accounting of PDB archival holdings and their utilization by researchers, educators, and students worldwide. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

Other

7 pages, 530 KiB  
Perspective
From Genes to Geography, from Cells to Community, from Biomolecules to Behaviors: The Importance of Social Determinants of Health
by Jaysón Davidson, Rohit Vashisht and Atul J. Butte
Biomolecules 2022, 12(10), 1449; https://doi.org/10.3390/biom12101449 - 9 Oct 2022
Cited by 5 | Viewed by 3017
Abstract
Much scientific work over the past few decades has linked health outcomes and disease risk to genomics, to derive a better understanding of disease mechanisms at the genetic and molecular level. However, genomics alone does not quite capture the full picture of one’s [...] Read more.
Much scientific work over the past few decades has linked health outcomes and disease risk to genomics, to derive a better understanding of disease mechanisms at the genetic and molecular level. However, genomics alone does not quite capture the full picture of one’s overall health. Modern computational biomedical research is moving in the direction of including social/environmental factors that ultimately affect quality of life and health outcomes at both the population and individual level. The future of studying disease now lies at the hands of the social determinants of health (SDOH) to answer pressing clinical questions and address healthcare disparities across population groups through its integration into electronic health records (EHRs). In this perspective article, we argue that the SDOH are the future of disease risk and health outcomes studies due to their vast coverage of a patient’s overall health. SDOH data availability in EHRs has improved tremendously over the years with EHR toolkits, diagnosis codes, wearable devices, and census tract information to study disease risk. We discuss the availability of SDOH data, challenges in SDOH implementation, its future in real-world evidence studies, and the next steps to report study outcomes in an equitable and actionable way. Full article
(This article belongs to the Special Issue Biomolecular Data Science—in Honor of Professor Philip E. Bourne)
Show Figures

Figure 1

Back to TopTop