Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification
Abstract
:1. Introduction
2. Results
2.1. Result of the Two-Stage Feature Selection
2.2. Comparison with other Prediction Methods
2.3. Application to Glioma Biomarker Identification
3. Materials and Methods
3.1. Data Collection
3.1.1. The CSF Protein Data
3.1.2. The Glioma Gene Expression Data
3.2. Prediction of Proteins in CSF
3.2.1. Feature Construction
3.2.2. Feature Selection
3.2.3. Protein Classification
3.3. Identification of Differentially Expressed Genes
3.4. Evaluation
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Sample Availability
Abbreviations
CSF | Cerebrospinal fluid |
CNS | Central nervous system |
SVM | Support vector machine |
DNN | Deep neural network |
PU | Positive unlabeled |
FDR | False discovery rate |
RFE | Recursive feature elimination |
FC | Fold change |
MCC | Matthew’s correlation coefficient |
AUC | Area under the ROC Curve |
References
- Huang, L.; Shao, D.; Wang, Y.; Cui, X.; Li, Y.; Chen, Q.; Cui, J. Human body-fluid proteome: Quantitative profiling and computational prediction. Brief. Bioinform. 2021, 22, 315–333. [Google Scholar] [CrossRef] [PubMed]
- Lleó, A.; Cavedo, E.; Parnetti, L.; Vanderstichele, H.; Herukka, S.K.; Andreasen, N.; Ghidoni, R.; Lewczuk, P.; Jeromin, A.; Winblad, B.; et al. Cerebrospinal fluid biomarkers in trials for Alzheimer and Parkinson diseases. Nat. Rev. Neurol. 2015, 11, 41–55. [Google Scholar] [CrossRef] [PubMed]
- Magdalinou, N.; Noyce, A.; Pinto, R.; Lindstrom, E.; Holmén-Larsson, J.; Holtta, M.; Blennow, K.; Morris, H.; Skillbäck, T.; Warner, T.; et al. Identification of candidate cerebrospinal fluid biomarkers in parkinsonism using quantitative proteomics. Park. Relat. Disord. 2017, 37, 65–71. [Google Scholar] [CrossRef] [PubMed]
- Sandri, B.J.; Kim, J.; Lubach, G.R.; Lock, E.F.; Guerrero, C.; Higgins, L.; Markowski, T.W.; Kling, P.J.; Georgieff, M.K.; Coe, C.L.; et al. Multiomic profiling of iron-deficient infant monkeys reveals alterations in neurologically important biochemicals in serum and cerebrospinal fluid before the onset of anemia. Am. J. Physiol.-Regul. Integr. Comp. Physiol. 2022, 322, R486–R500. [Google Scholar] [CrossRef] [PubMed]
- Sandri, B.J.; Kim, J.; Lubach, G.R.; Lock, E.F.; Guerrero, C.; Higgins, L.; Markowski, T.W.; Kling, P.J.; Georgieff, M.K.; Coe, C.L.; et al. Tandem mass tag proteomic and untargeted metabolomic profiling reveals altered serum and CSF biochemical datasets in iron deficient monkeys. Data Brief 2022, 45, 108591. [Google Scholar] [CrossRef]
- Shen, F.; Zhang, Y.; Yao, Y.; Hua, W.; Zhang, H.s.; Wu, J.s.; Zhong, P.; Zhou, L.f. Proteomic analysis of cerebrospinal fluid: Toward the identification of biomarkers for gliomas. Neurosurg. Rev. 2014, 37, 367–380. [Google Scholar] [CrossRef]
- Blennow, K.; Dubois, B.; Fagan, A.M.; Lewczuk, P.; Leon, M.J.; Hampel, H. Clinical utility of cerebrospinal fluid biomarkers in the diagnosis of early Alzheimer’s disease. Alzheimer’s Dement. 2015, 11, 58–69. [Google Scholar] [CrossRef]
- Cui, J.; Liu, Q.; Puett, D.; Xu, Y. Computational prediction of human proteins that can be secreted into the bloodstream. Bioinformatics 2008, 24, 2370–2375. [Google Scholar] [CrossRef]
- Hong, C.S.; Cui, J.; Ni, Z.; Su, Y.; Puett, D.; Li, F.; Xu, Y. A Computational Method for Prediction of Excretory Proteins and Application to Identification of Gastric Cancer Markers in Urine. PLoS ONE 2011, 6, e16875. [Google Scholar] [CrossRef]
- Hu, L.L.; Huang, T.; Cai, Y.D.; Chou, K.C. Prediction of Body Fluids where Proteins are Secreted into Based on Protein Interaction Network. PLoS ONE 2011, 6, e22989. [Google Scholar] [CrossRef]
- Wang, J.; Liang, Y.; Wang, Y.; Cui, J.; Liu, M.; Du, W.; Xu, Y. Computational Prediction of Human Salivary Proteins from Blood Circulation and Application to Diagnostic Biomarker Identification. PLoS ONE 2013, 8, e80211. [Google Scholar] [CrossRef]
- Sun, Y.; Du, W.; Zhou, C.; Zhou, Y.; Cao, Z.; Tian, Y.; Wang, Y. A Computational Method for Prediction of Saliva-Secretory Proteins and Its Application to Identification of Head and Neck Cancer Biomarkers for Salivary Diagnosis. IEEE Trans. Nanobiosci. 2015, 14, 167–174. [Google Scholar] [CrossRef]
- Wang, Y.; Du, W.; Liang, Y.; Chen, X.; Zhang, C.; Pang, W.; Xu, Y. PUEPro: A Computational Pipeline for Prediction of Urine Excretory Proteins. In Proceedings of the Advanced Data Mining and Applications, Gold Coast, QLD, Australia, 12–15 December 2016; Volume 10086 LNAI, pp. 714–725. [Google Scholar] [CrossRef]
- Shao, D.; Huang, L.; Wang, Y.; Cui, X.; He, K.; Wang, Y. Computational Prediction of Human Body-Fluid Protein. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; IEEE: San Diego, CA, USA, 2019; pp. 2735–2740. [Google Scholar] [CrossRef]
- Shao, D.; Huang, L.; Wang, Y.; He, K.; Cui, X.; Wang, Y.; Ma, Q.; Cui, J. DeepSec: A deep learning framework for secreted protein discovery in human body fluids. Bioinformatics 2021, 38, 228–235. [Google Scholar] [CrossRef]
- He, K.; Wang, Y.; Xie, X.; Shao, D. MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids. Mathematics 2022, 10, 2562. [Google Scholar] [CrossRef]
- Li, F.; Dong, S.; Leier, A.; Han, M.; Guo, X.; Xu, J.; Wang, X.; Pan, S.; Jia, C.; Zhang, Y.; et al. Positive-unlabeled learning in bioinformatics and computational biology: A brief review. Brief. Bioinform. 2022, 23, 1–13. [Google Scholar] [CrossRef]
- Nan, X.; Bao, L.; Zhao, X.; Zhao, X.; Sangaiah, A.; Wang, G.G.; Ma, Z. EPuL: An Enhanced Positive-Unlabeled Learning Algorithm for the Prediction of Pupylation Sites. Molecules 2017, 22, 1463. [Google Scholar] [CrossRef]
- Zhang, Y.L.; Li, L.; Zhou, J.; Li, X.; Liu, Y.; Zhang, Y.; Zhou, Z.H. Poster: A PU learning based system for potential malicious URL detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 2599–2601. [Google Scholar] [CrossRef]
- Zheng, Y.; Peng, H.; Zhang, X.; Zhao, Z.; Gao, X.; Li, J. DDI-PULearn: A positive-unlabeled learning method for large-scale prediction of drug-drug interactions. BMC Bioinform. 2019, 20, 661. [Google Scholar] [CrossRef]
- Wei, H.; Xu, Y.; Liu, B. iPiDi-PUL: Identifying Piwi-interacting RNA-disease associations based on positive unlabeled learning. Brief. Bioinform. 2021, 22, 1–11. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
- Pedregosa, F.; Weiss, R.; Brucher, M.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Maaten, L.V.D.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Cai, W.; Tucholski, T.; Chen, B.; Alpert, A.J.; McIlwain, S.; Kohmoto, T.; Jin, S.; Ge, Y. Top-Down Proteomics of Large Proteins up to 223 kDa Enabled by Serial Size Exclusion Chromatography Strategy. Anal. Chem. 2017, 89, 5467–5475. [Google Scholar] [CrossRef]
- Shao, D.; Huang, L.; Wang, Y.; Cui, X.; Li, Y.; Wang, Y.; Ma, Q.; Du, W.; Cui, J. HBFP: A new repository for human body fluid proteome. Database 2021, 2021, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Goldman, M.J.; Craft, B.; Hastie, M.; Repečka, K.; McDade, F.; Kamath, A.; Banerjee, A.; Luo, Y.; Rogers, D.; Brooks, A.N.; et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol. 2020, 38, 675–678. [Google Scholar] [CrossRef]
- Lonsdale, J.; Thomas, J.; Salvatore, M.; Phillips, R.; Lo, E.; Shad, S.; Hasz, R.; Walters, G.; Garcia, F.; Young, N.; et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013, 45, 580–585. [Google Scholar] [CrossRef]
- Smyth, G.K. limma: Linear Models for Microarray Data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer: New York, NY, USA, 2005; pp. 397–420. [Google Scholar] [CrossRef]
- Rao, H.B.; Zhu, F.; Yang, G.B.; Li, Z.R.; Chen, Y.Z. Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2011, 39, W385–W390. [Google Scholar] [CrossRef] [PubMed]
- Bateman, A. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019, 47, D506–D515. [Google Scholar] [CrossRef]
- Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Mordelet, F.; Vert, J.P. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognit. Lett. 2014, 37, 201–209. [Google Scholar] [CrossRef]
- Eke, C.S.; Jammeh, E.; Li, X.; Carroll, C.; Pearson, S.; Ifeachor, E. Early Detection of Alzheimer’s Disease with Blood Plasma Proteins Using Support Vector Machines. IEEE J. Biomed. Health Inform. 2021, 25, 218–226. [Google Scholar] [CrossRef]
- Tanveer, M.; Rashid, A.H.; Ganaie, M.A.; Reza, M.; Razzak, I.; Hua, K.L. Classification of Alzheimer’s Disease Using Ensemble of Deep Neural Networks Trained Through Transfer Learning. IEEE J. Biomed. Health Inform. 2022, 26, 1453–1463. [Google Scholar] [CrossRef]
Methods | ACC | PR | RE | F1 | MCC | AUC |
---|---|---|---|---|---|---|
SVM | 0.6158 | 0.9003 | 0.2605 | 0.4040 | 0.3293 | 0.7891 |
DT | 0.6140 | 0.6923 | 0.4102 | 0.5152 | 0.2496 | 0.6140 |
DNN | 0.6726 | 0.8367 | 0.4288 | 0.5670 | 0.3953 | 0.7697 |
Our method | 0.7260 | 0.7229 | 0.7330 | 0.7279 | 0.4521 | 0.8041 |
Id | Accession | FC | Probability | q-Value | Type |
---|---|---|---|---|---|
1 | Q9H4X1 | 3.19 | 92.66% | 0.0218 | up |
2 | P28370 | 3.18 | 92.15% | 0.0007 | up |
3 | P49368 | 2.38 | 93.18% | 0.0073 | up |
4 | O14497 | 2.29 | 90.60% | 0.0094 | up |
5 | Q9P2E5 | 2.21 | 90.93% | 0.0065 | up |
6 | Q9UPP2 | 0.14 | 90.21% | 0.0105 | down |
7 | Q12955 | 0.14 | 96.78% | 0.0053 | down |
8 | Q14643 | 0.24 | 91.32% | 0.0207 | down |
9 | O15020 | 0.32 | 91.48% | 0.0224 | down |
10 | Q70CQ2 | 0.36 | 95.00% | 0.0040 | down |
11 | Q96M86 | 0.47 | 93.95% | 0.0109 | down |
Type | Feature Name | Length |
---|---|---|
General sequence properties | Sequence length | 1 |
Mass | 1 | |
Amino acid composition | 20 | |
Dipeptides composition | 400 | |
Normalized Moreau–Broto autocorrelation descriptors | 90 | |
Moran autocorrelation | 90 | |
Geary autocorrelation descriptors | 90 | |
Quasi-sequence-order descriptors | 160 | |
Pseudo-amino acid composition | 150 | |
Amphiphilic pseudo-amino acid composition | 80 | |
Total amino acid property | 3 | |
Physicochemical properties | Hydrophobicity | 21 |
Normalized Van der Waals volumes | 21 | |
Polarity | 21 | |
Polarizability | 21 | |
Charge | 21 | |
Solvent accessibility | 21 | |
Surface tension | 21 | |
Molecular weight | 21 | |
Solubility in water | 21 | |
No. of hydrogen bond donors in side chain | 21 | |
No. of hydrogen bond acceptors in side chain | 21 | |
CLogP | 21 | |
Amino acid flexibility index | 21 | |
Protein–protein Interface hotspot propensity—Bogan | 21 | |
Protein–protein Interface (PPI) propensity—Ma | 21 | |
Protein–DNA Interface propensity—Schneider | 21 | |
Protein–DNA Interface propensity—Ahmad | 21 | |
Protein–RNA Interface propensity—Kim | 21 | |
Protein–RNA Interface propensity—Ellis | 21 | |
Protein–RNA Interface propensity—Phipps | 21 | |
Protein–ligand binding site propensity—Khazanov | 21 | |
Protein–ligand valid binding site propen—Khazanov | 21 | |
Propensity for protein–ligand polar and atom–Imai | 21 | |
Isoelectric point | 1 | |
Domains/motifs properties | Twin-arginine signal peptide | 1 |
Transmembrane domains | 1 | |
Signal peptide | 1 | |
Number of glycosylation sites | 1 | |
Glycosylation presence | 1 | |
Phosphorylation sites | 1 | |
Cleavage site | 3 | |
Subcellular location | 3 | |
Percentage of coil content | 1 | |
Number of predicted motif sites | 1 | |
Transmembrane helices | 1 | |
Structural properties | Secondary structure | 21 |
Unfoldability | 1 | |
Fldbin charge | 1 | |
Number of disordered regions | 1 | |
Longest disordered regions | 1 | |
Number of disordered residues | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, K.; Wang, Y.; Xie, X.; Shao, D. Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification. Molecules 2023, 28, 3617. https://doi.org/10.3390/molecules28083617
He K, Wang Y, Xie X, Shao D. Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification. Molecules. 2023; 28(8):3617. https://doi.org/10.3390/molecules28083617
Chicago/Turabian StyleHe, Kai, Yan Wang, Xuping Xie, and Dan Shao. 2023. "Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification" Molecules 28, no. 8: 3617. https://doi.org/10.3390/molecules28083617
APA StyleHe, K., Wang, Y., Xie, X., & Shao, D. (2023). Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification. Molecules, 28(8), 3617. https://doi.org/10.3390/molecules28083617