A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics
Abstract
:1. Introduction
2. Results
2.1. Sources of Compound/Bioactivity Databases
- ChEMBL28 [1] containing experimentally determined bioactivity data for 2.1 million drug-like bioactive molecules;
- PubChem [2] (downloaded 11.01.21) containing chemical and physical properties, as well as biological activities for more than 110 million molecules;
- BindingDB [3] (downloaded 25.02.21) containing experimentally determined binding affinities to biological targets for approx. 26,000 drug-like bioactive molecules;
- IUPHAR/BPS [4] Guide to Pharmacology (version 2021.1) containing curated information on biological targets and bioactivity for selected, pharmacologically active tool compounds;
- Probes & Drugs [5] (version 02b_2021) containing bioactivity data as well as target and signaling pathway information for more than 30,000 compounds from 29 public and commercial libraries, with great attention to chemical probes and drugs.
- ChEMBL28: 1,131,947 molecules;
- PubChem: 444,152 molecules;
- BindingDB: 26,856 molecules;
- IUPHAR/BPS: 7371 molecules;
- Probes & Drugs: 34,211 molecules.
2.2. Analysis of Public Compound/Bioactivity Databases
2.2.1. Database Coverage and Overlap
2.2.2. Compound Space of Databases
2.2.3. Activity Analysis
- In ChEMBL, there were 6,575,449 annotated bioactivities covering 4081 targets for the 1,131,947 selected molecules, of which 23.3% were active, 19.8% weakly active, 43.5% inactive, 13.2% not specified, and 0.2% had no data point;
- In PubChem (for comparison/curation), there were 11,587,761 annotated bioactivities on 3199 targets for 444,152 selected molecules, of which 47.1% were active, 37.2% weakly active, 15.6% inactive, 0.01% not specified, and 0% had no data point;
- In BindingDB, there were 51,424 annotated bioactivities on 758 targets for 26,856 molecules, of which 52.7% were active, 23.3% weakly active, 22.2% inactive, 0% not specified, and 1.8% had no data point;
- In IUPHAR/BPS, there were 15,557 annotated bioactivities covering 1657 targets for 7371 molecules, of which 79.8% were active, 11.5% weakly active, 5% inactive, 3.3% not specified, and 0.5% had no data point;
- In Probes & Drugs, there were 930,209 bioactivities on 4042 targets for 34,211 molecules, of which 22.9% were active, 17% weakly active, 16.2% inactive, 30.4% not specified, and 13.5% had no data point.
2.3. Consensus Database Assembly
- Compound information: ChEMBL ID, PubChem ID, IUPHAR/BPS ID, and all unique names contained in the five source databases are listed. The molecular structure is expressed as canonical SMILES. The biological target is given using HGNC symbols and the corresponding target family;
- Experimentally determined bioactivity (including unit of measure, activity type, assay type), pivoted/grouped by the ligand, target, unit of measure, activity type, and assay type. For IUPHAR/BPS, BindingDB, and Probes & Drugs, the assay type is defined by the activity type (Ki, Kd = cell-free; IC50, EC50 = cell-based) and for ChEMBL and PubChem through keyword search in the assay description (cell-free: e.g., “binding affinity”, “recruitment”, “displacement”, “fret”, “htrf”; cell-based: e.g., “cell” in combination with “reporter” or “receptor”, “transfection”, “transactivation”, “luciferase”, “galactosidase”; functional: e.g., “cell proliferation”, “cell” without “reporter” or “receptor”; unspecified: any other). After pivoting, the dataset has multiple columns representing different bioactivities from the different databases and up to four rows for different assay types per compound/target pair. In addition, the number of activity values for the respective compound/target pair in the respective database is noted;
- Activity check annotation: For automated curation and to add confidence on annotated bioactivity data, bioactivity values with common activity type and assay type from different source databases were compared and flagged as: no comment: bioactivity values are within one log unit; check activity data: bioactivity values are not within one log unit; only one data point: only one value was available, no comparison and no range calculated; no activity value: no precise numeric activity value was available; no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration;
- Structure check (Tanimoto): To denote matching or differing compound structures in different source databases, we added a structure check label as follows: match: molecule structures are the same between different sources; no match: the structures differ; one structure: no structure comparison was possible, because there was only one structure available; no structure: no structure comparison was possible, because there was no structure available. For structure annotations that did not match, we computed the Jaccard–Tanimoto similarity coefficient on Morgan fingerprints [24] for sets of structures for one compound from the different sources to reveal true structural differences. The minimum similarity value is presented in the final dataset together with the structure check.
2.4. Analysis of the Consensus Database
3. Discussion
4. Materials and Methods
4.1. Data Collection and Database Merging
- Download the databases and the additional compound identifier mapping information between the different databases (ChEMBL ID, PubChem CID, and UniProt ID) and loading all data into the Konstanz Information Miner (KNIME);
- ChEMBL downloaded as PostgreSQL database;
- PubChem downloaded as RDF-files;
- BindingDB downloaded as SDF-file;
- IUPHAR/BPS downloaded as CSV-file;
- Probes & Drugs downloaded as PostgreSQL database.
- Mapping the compounds to the compound identifier information;
- Only records for the organism homo sapiens and for compounds with a molecular weight lower than 1500 were considered;
- Standardization of target names using HGNC (gene symbols);
- Calculation of the negative decadic logarithm of experimental readouts with a molar unit; all other experimental readouts with other units were not changed;
- Classification of records from ChEMBL and PubChem into cell-free, cell-based, and functional using keywords from the assay description (assay type); for IUPHAR/BPS, BindingDB, and Probes & Drugs, the activity type defined the assay type;
- Pivoting the data of each source database first by compound ID (ChEMBL ID, PubChem ID) and then pivoting data referring to one molecule by gene symbol, activity type, assay type, and unit. After that, the ranges of log-values were calculated:
- range ≤ 1 -> calculate mean of log-values + frequency;
- range > 1 -> split into several columns + frequency;
- no activity value deposited -> searching for information in assay comments and write them in columns + frequency.
- Merging the different databases based on their compound ID (ChEMBL ID, PubChem ID), gene symbol, activity type, assay type, and unit;
- Activity check annotation between databases: Calculating range between highest and lowest value in the new dataset:
- range ≤ 1 -> no comment;
- range > 1 -> ‘check activity data’;
- range not calculable -> ‘only 1 data point’;
- no activity value deposited -> ‘no activity value’;
- no log-value -> ‘no log-value could be calculated. Please check the matches by yourself’.
- Salts were removed from the molecular structures and the remaining structures were converted into canonical SMILES strings. Lastly, the molecules were screened for tautomers and the canonical tautomer was generated.
- Structure check: The structure check was performed on the standardized SMILES by string comparison. Only for entries that did not match, we compared the structures with the Jaccard–Tanimoto coefficient computed on Morgan fingerprints and noted the minimum value.
4.2. Software and Code
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Sample Availability
Abbreviations
HGNC | HUGO Gene Nomenclature Committee at the European Bioinformatics Institute |
KNIME | Konstanz Information Miner; SMILES, Simplified Molecular Input Line Entry System |
SMILES | Simplified Molecular Input Line Entry System |
UMAP | uniform manifold approximation and projection for dimension reduction |
References
- Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; de Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka, M.; et al. ChEMBL: Towards Direct Deposition of Bioassay Data. Nucleic Acids Res. 2019, 47, D930–D940. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem in 2021: New Data Content and Improved Web Interfaces. Nucleic Acids Res. 2021, 49, D1388–D1395. [Google Scholar] [CrossRef] [PubMed]
- Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A Public Database for Medicinal Chemistry, Computational Chemistry and Systems Pharmacology. Nucleic Acids Res. 2016, 44, D1045–D1053. [Google Scholar] [CrossRef] [PubMed]
- Harding, S.D.; Armstrong, J.F.; Faccenda, E.; Southan, C.; Alexander, S.P.H.; Davenport, A.P.; Pawson, A.J.; Spedding, M.; Davies, J.A. The IUPHAR/BPS Guide to PHARMACOLOGY in 2022: Curating Pharmacology for COVID-19, Malaria and Antibacterials. Nucleic Acids Res. 2022, 50, D1282–D1294. [Google Scholar] [CrossRef]
- Škuta, C.; Southan, C.; Bartůněk, P. Will the Chemical Probes Please Stand Up? RSC Med. Chem. 2021, 12, 1428–1441. [Google Scholar] [CrossRef]
- Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z.; et al. DrugBank 5.0: A Major Update to the DrugBank Database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. [Google Scholar] [CrossRef]
- Wassermann, A.M.; Bajorath, J. BindingDB and ChEMBL: Online Compound Databases for Drug Discovery. Expert Opin. Drug Discov. 2011, 6, 683–687. [Google Scholar] [CrossRef]
- Merk, D.; Friedrich, L.; Grisoni, F.; Schneider, G. De Novo Design of Bioactive Small Molecules by Artificial Intelligence. Mol. Inform. 2018, 37, 1700153. [Google Scholar] [CrossRef] [Green Version]
- Moret, M.; Helmstädter, M.; Grisoni, F.; Schneider, G.; Merk, D. De Novo Design Beam Search for Automated Design and Scoring of NovelR OR Ligands with Machine Intelligence. Angew. Chem. Int. Ed. 2021, 60, 19477–19482. [Google Scholar] [CrossRef]
- Moret, M.; Friedrich, L.; Grisoni, F.; Merk, D.; Schneider, G. Generative Molecular Design in Low Data Regimes. Nat. Mach. Intell. 2020, 2, 171–180. [Google Scholar] [CrossRef]
- Merk, D.; Grisoni, F.; Friedrich, L.; Schneider, G. Tuning Artificial Intelligence on the de Novo Design of Natural-Product-Inspired Retinoid X Receptor Modulators. Commun. Chem. 2018, 1, 68. [Google Scholar] [CrossRef] [Green Version]
- Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inform. 2010, 29, 476–488. [Google Scholar] [CrossRef] [PubMed]
- Griffen, E.J.; Dossetter, A.G.; Leach, A.G.; Montague, S. Can We Accelerate Medicinal Chemistry by Augmenting the Chemist with Big Data and Artificial Intelligence? Drug Discov. Today 2018, 23, 1373–1384. [Google Scholar] [CrossRef] [PubMed]
- Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. Are the Chemical Structures in Your QSAR Correct? QSAR Comb. Sci. 2008, 27, 1337–1345. [Google Scholar] [CrossRef]
- Fourches, D.; Muratov, E.; Tropsha, A. Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J. Chem. Inf. Modeling 2010, 50, 1189–1204. [Google Scholar] [CrossRef]
- Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
- Bemis, G.W.; Murcko, M.A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887–2893. [Google Scholar] [CrossRef]
- Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Modeling 2010, 50, 742–754. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426v3. [Google Scholar]
- Valsecchi, C.; Grisoni, F.; Motta, S.; Bonati, L.; Ballabio, D. NURA: A Curated Dataset of Nuclear Receptor Modulators. Toxicol. Appl. Pharmacol. 2020, 407, 115244. [Google Scholar] [CrossRef]
- Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007); Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- Tweedie, S.; Braschi, B.; Gray, K.; Jones, T.E.M.; Seal, R.L.; Yates, B.; Bruford, E.A. Genenames.Org: The HGNC and VGNC Resources in 2021. Nucleic Acids Res. 2021, 49, D939–D946. [Google Scholar] [CrossRef] [PubMed]
- Jin, H.; Moseley, H.N.B. Hierarchical Harmonization of Atom-Resolved Metabolic Reactions across Metabolic Databases. Metabolites 2021, 11, 431. [Google Scholar] [CrossRef] [PubMed]
- Morgan, H. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107–113. [Google Scholar] [CrossRef]
Category | Total Number | Percentage |
---|---|---|
Exact match of the activity values | 987,022 | 73 |
Match within one log change | 45,831 | 3.4 |
Outside one log change (no match) | 51,912 | 3.8 |
No activity value | 192,951 | 14.3 |
No log value could be calculated | 74,898 | 5.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Isigkeit, L.; Chaikuad, A.; Merk, D. A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics. Molecules 2022, 27, 2513. https://doi.org/10.3390/molecules27082513
Isigkeit L, Chaikuad A, Merk D. A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics. Molecules. 2022; 27(8):2513. https://doi.org/10.3390/molecules27082513
Chicago/Turabian StyleIsigkeit, Laura, Apirat Chaikuad, and Daniel Merk. 2022. "A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics" Molecules 27, no. 8: 2513. https://doi.org/10.3390/molecules27082513
APA StyleIsigkeit, L., Chaikuad, A., & Merk, D. (2022). A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics. Molecules, 27(8), 2513. https://doi.org/10.3390/molecules27082513