Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset
2.2. Feature Extraction
- (i)
- Sequence based features: The amino acid composition of the entire amino acid sequence and Chou’s pseudo amino acid composition (PseAAC) in various modes were generated—pseudo amino acid composition (PseAAC) in parallel and in series correlations. Chou’s PseAAC [37] is widely used to convert complicated protein sequences with various lengths to fixed-length numerical feature vectors that incorporate sequence-order information. Compared to AAC, PseAAC is more informative, and can represent a protein sequence and incorporate its sequence-order information. Hence, it is widely applied for prediction in various amino acid sequence-based prediction problems [38]. PseACC was calculated using the Pse-in-one program [39] with parameter lambda = 2, 10 and weight = 0.05, 0.1.
- (ii)
- Composition–transition–distribution (CTD): Three types of descriptors based on the grouped amino acid composition [40,41] (composition (CTDC), transition (CTDT) and distribution (CTDD) descriptors) were calculated. CTD was calculated using the protr R package [42,43]. All amino acid residues were divided into 3 groups—neutral, hydrophobic, and polar—according to 7 types of physicochemical properties, as defined in [41]. The 7 physicochemical properties used for calculating these features were hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structures, and solvent accessibility.
- (iii)
- Various physicochemical property-based features: Quasi-sequence-order descriptors (QSO) [44], crucian properties [45], zScales [46], FASGAI vectors (factor analysis scales of generalized amino acid information) [47], tScales [48], VHSE-scales (principal components score vectors of hydrophobic, steric, and electronic properties) [49], protFP [50], stScale [51], MS-WHIM score [52], the aliphatic index [53], the autocovariance index [53], the Boman (potential protein interaction) index [54], the net charge, cross-covariance index [45], instability index [55], the hydrophobic moment, and the isoelectic point (pI) were calculated using the peptide R package [56] with parameter nlag = 10 and weight = 0.1.
- (iv)
- Signal peptide-based features: In addition to the sequence features mentioned above, functional or signal peptide regions were used in this prediction. The signal peptide was associated with the transfer to or function of a protein in its localization site [57]. Nuclear localization signals (NLSs) were used as important features for detecting nuclear proteins. For example, a protein containing a signal peptide is likely to be transferred to the secretory pathway, while a protein containing an NLS is likely to be localized in the nucleus. In this work, to identify the signal sequences for the secretory pathway (signal peptides) and predict the positions of the signal peptide cleavage sites and transmembrane, the prediction scores obtained from well-known prediction programs, such as TargetP [58], SignalP [59], Phobius [60], and TMHMM [61], were used as feature scores (features: SP, cTP, mTP, other, and TM). The NLS was predicted using the Hidden Markov Models (HMMs) of NLStradamus [62] to predict the NLSs of the sequences (feature: NLS). However, there are some limitations of this type of feature; i.e., the signal peptide is not yet completely understood, and the set of currently known signals might be incomplete.
- (v)
- Integration of other methods: We used the ERPred [63] Score and SubMito [64] SVM scores as features for discriminating ER and mitochondrial proteins, respectively. These programs were not used directly to predict locations. However, they were used to generate the numerical feature to complement each other as parts of the model to learn in making decisions.
- (vi)
- Secondary structure conformation features: The aggregation, amyloid, turn, alpha-helix, helical aggregation, and beta-strand conformation secondary structures were calculated using the Tango program [65].
- (vii)
- Homology and Gene Ontology (GO) annotation-based features: BLAST [66] was used to search for homologous sequences. This feature is highly effective when a homologous protein with a localization annotation is available. Evolutionarily, closely related proteins present a high probability of showing similar subcellular localizations. Therefore, this type of feature can outperform other features when a homologous protein with a localization annotation is available [67]. However, there is also a limitation of this type of feature, where no homology is found between the query and target sequence. The performance of sequence homology-based methods might be significantly reduced when homologous sequences are not detected [68]. However, using the GO feature can result in a noisy and confound prediction [69,70] in the case when a protein could have multiple GO terms that map to different subcellular localizations, resulting in inconsistency with the true subcellular locations of proteins [12]. The GO database used in this work is a compact database that was filtered to remove redundant information (<25% sequence similarity threshold) and contained only representative sequences that did not overlap in the training and testing data. A set of GO terms in the “cellular component” category was retrieved by searching against the Gene Ontology Annotation database [71] and the UniProtKB/Swiss-Prot database [72]. The GO terms used in this work are summarized in Table 3.
2.3. Feature Selection
2.4. Model Selection
2.5. Evaluation Measurement
3. Results and Discussion
3.1. Comparison of Different Features/Feature Analysis
3.2. Discriminative and Informative Reduced Feature Subset
3.3. A 10-Fold Cross-Validation of Predictive Performance with the Training Dataset
3.4. Classification Performance for the Independent Testing Dataset
3.5. Comparison with Other Existing Tools
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Casadio, R.; Martelli, P.L.; Pierleoni, A. The prediction of protein subcellular localization from sequence: A shortcut to functional genome annotation. Brief. Funct. Genom. Proteom. 2008, 7, 63–73. [Google Scholar] [CrossRef] [PubMed]
- Tung, C.; Chen, C.; Sun, H.; Chu, Y. Predicting human protein subcellular localization by heterogeneous and comprehensive approaches. PLoS ONE 2017, 12, e0178832. [Google Scholar] [CrossRef] [Green Version]
- Kumar, R.; Dhanda, S.K. Bird Eye View of Protein Subcellular Localization Prediction. Life 2020, 10, 347. [Google Scholar] [CrossRef]
- Kumar, A.; Ahmad, A.; Vyawahare, A.; Khan, R. Membrane Trafficking and Subcellular Drug Targeting Pathways. Front. Pharm. 2020, 11, 629. [Google Scholar] [CrossRef] [PubMed]
- Rajendran, L.; Knölker, H.; Simons, K. Subcellular targeting strategies for drug design and delivery. Nat. Rev. Drug Discov. 2010, 9, 29–42. [Google Scholar] [CrossRef] [PubMed]
- The UniProt Consortium. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2017, 45, D158–D169. [Google Scholar] [CrossRef] [Green Version]
- Tung, T.; Lee, D. A method to improve protein subcellular localization prediction by integrating various biological data sources. BMC Bioinform. 2009, 10, S43. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yang, F.; Liu, Y.; Wang, Y.; Yin, Z.; Yang, Z. MIC_Locator: A novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy. BMC Bioinform. 2019, 20, 522. [Google Scholar] [CrossRef] [PubMed]
- Zou, H.; Xiao, X. Predicting the Functional Types of Singleplex and Multiplex Eukaryotic Membrane Proteins via Different Models of Chou’s Pseudo Amino Acid Compositions. J. Membr. Biol. 2016, 249, 23–29. [Google Scholar] [CrossRef]
- Blum, T.; Briesemeister, S.; Kohlbacher, O. MultiLoc2: Integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinform. 2009, 10, 274. [Google Scholar] [CrossRef] [Green Version]
- Sahu, S.; Loaiza, C.; Kaundal, R. Plant-mSubP: A computational framework for the prediction of single- and multi-target protein subcellular localization using integrated machine-learning approaches. AoB Plants 2020, 12, plz068. [Google Scholar] [CrossRef] [PubMed]
- Wan, S.; Mak, M.W.; Kung, S.Y. mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform. 2012, 13, 290. [Google Scholar] [CrossRef] [Green Version]
- Chi, S.M.; Nam, D. Wegoloc: Accurate prediction of protein subcellular localization using weighted gene ontology terms. Bioinformatics 2012, 28, 1028–1030. [Google Scholar] [CrossRef] [PubMed]
- Goldberg, T.; Hecht, M.; Hamp, T.; Karl, T.; Yachdav, G.; Ahmed, N.; Altermann, U.; Angerer, P.; Ansorge, S.; Balasz, K.; et al. LocTree3 prediction of localization. Nucleic Acids Res. 2014, 42, W350–W355. [Google Scholar] [CrossRef] [Green Version]
- Chou, K.-C.; Shen, H.-B. Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization. PLoS ONE 2010, 5, e11335. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, Z.C.; Xiao, X.; Chou, K.C. iLoc-Plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol. Biosyst. 2011, 7, 3287–3297. [Google Scholar] [CrossRef] [PubMed]
- Briesemeister, S.; Rahnenführer, J.; Kohlbacher, O. YLoc–an interpretable web server for predicting subcellular local-ization. Nucleic Acids Res. 2010, 38, W497–W502. [Google Scholar] [CrossRef] [PubMed]
- King, B.R.; Vural, S.; Pandey, S.; Barteau, A.; Gudaet, C. ngLOC: Software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes. BMC Res. Notes 2012, 5, 351. [Google Scholar] [CrossRef] [Green Version]
- Adelfio, A.; Volpato, V.; Pollastri, G. SCLpredT: Ab initio and homology-based prediction of subcellular localization by N-to-1 neural networks. SpringerPlus 2013, 2, 1–11. [Google Scholar] [CrossRef] [Green Version]
- Wei, L.; Ding, Y.; Su, R.; Tang, J.; Zou, Q. Prediction of human protein subcellular localization using deep learning. J. Parallel Distrib. Comput. 2018, 117, 212–217. [Google Scholar] [CrossRef]
- Cheng, X.; Xiao, X.; Chou, K. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018, 110, 50–58. [Google Scholar] [CrossRef] [PubMed]
- Wan, S.; Mak, M.W.; Kung, S.Y. HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS ONE 2014, 9, e89545. [Google Scholar] [CrossRef] [Green Version]
- Savojardo, C.; Martelli, P.L.; Fariselli, P.; Profiti, G.; Casadio, R. BUSCA: An integrative web server to predict subcellular localization of proteins. Nucleic Acids Res. 2018, 46, W459–W466. [Google Scholar] [CrossRef] [PubMed]
- Sperschneider, J.; Catanzariti, A.; DeBoer, K.; Petre, B.; Gardiner, D.; Singh, K.; Dodds, P.; Taylor, J. LOCALIZER: Subcellular localization prediction of both plant and effector proteins in the plant cell. Sci. Rep. 2017, 7, 44598. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Duan, X. Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J. Theor. Biol. 2018, 437, 239–250. [Google Scholar] [CrossRef]
- Yao, Y.; Lv, Y.; Li, L.; Xu, H.; Ji, B.; Chen, J.; Li, C.; Liao, B.; Nan, X. Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method. BMC Bioinform. 2019, 20, 719. [Google Scholar] [CrossRef] [Green Version]
- Li, B.; Cai, L.; Liao, B.; Fu, X.; Bing, P.; Yang, J. Prediction of Protein Subcellular Localization Based on Fusion of Multiview Features. Molecules 2019, 24, 919. [Google Scholar] [CrossRef] [Green Version]
- Chou, K.; Shen, H. A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0. PLoS ONE 2010, 5, e9931. [Google Scholar] [CrossRef]
- Nuannimnoi, S.; Lertampaiporn, S.; Thammarongtham, C. Improved prediction of eukaryotic protein subcellular localization using particle swarm optimization of multiple classifiers. In Proceedings of the IEEE 21st International Computer Science and Engineering Conference (ICSEC), Bangkok, Thailand, 15–18 November 2017; pp. 1–5. [Google Scholar]
- Lertampaiporn, S.; Nuannimnoi, S.; Vorapreeda, T.; Chokesajjawatee, N.; Visessanguan, W.; Thammarongtham, C. PSO-LocBact: A Consensus Method for Optimizing Multiple Classifier Results for Predicting the Subcellular Localization of Bacterial Proteins. Biomed. Res. Int. 2019, 2019, 5617153. [Google Scholar] [CrossRef] [Green Version]
- Shen, H.; Chou, K. Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem. Biophys. Res. Commun. 2007, 355, 1006–1011. [Google Scholar] [CrossRef]
- Du, L.; Meng, Q.; Chen, Y.; Wu, P. Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA. BMC Bioinform. 2020, 21, 212. [Google Scholar] [CrossRef] [PubMed]
- Wolpert, D.; Macready, W. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef] [Green Version]
- Kuncheva, L. Combining Pattern Classifiers: Methods and Algorithms, 2nd ed.; Wiley: Hoboken, NJ, USA, 2014; Volume 8, pp. 263–272. [Google Scholar] [CrossRef]
- Polikar, R. Ensemble Based Systems in Decision Making. IEEE Circuits Syst. Mag. 2006, 6, 21–45. [Google Scholar] [CrossRef]
- Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [Green Version]
- Chou, K.C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef]
- Chou, K.C. Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review). J. Theor. Biol. 2011, 273, 236–247. [Google Scholar] [CrossRef]
- Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43, W65–W71. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef] [Green Version]
- Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S. Recognition of a protein fold in the context of the scop classification. Proteins Struct. Funct. Genet. 1999, 35, 401–407. [Google Scholar] [CrossRef]
- Xiao, N.; Cao, D.S.; Zhu, M.F.; Xu, Q.S. protr/ProtrWeb: R package and web server for generating various numerical repre-sentation schemes of protein sequences. Bioinformatics 2015, 31, 1857–1859. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2012; ISBN 3-900051-07-0. [Google Scholar]
- Chou, K.C. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 2000, 278, 477–483. [Google Scholar] [CrossRef]
- Cruciani, G.; Baroni, M.; Carosati, E.; Clementi, M.; Valigi, R.; Clementi, S. Peptide studies by means of principal properties of amino acids derived from MIF descriptors. J. Chemom. 2004, 18, 146–155. [Google Scholar] [CrossRef]
- Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjostrom, M.; Wold, S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 1998, 41, 2481–2491. [Google Scholar] [CrossRef]
- Liang, G.; Li, Z. Factor analysis scale of generalized amino acid information as the source of a new set of descriptors for elucidating the structure and activity relationships of cationic antimicrobial peptides. Mol. Inform. 2007, 26, 754–763. [Google Scholar] [CrossRef]
- Tian, F.; Zhou, P.; Li, Z. T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides. J. Mol. Struct. 2007, 830, 106–115. [Google Scholar] [CrossRef]
- Mei, H.U.; Liao, Z.H.; Zhou, Y.; Li, S.Z. A new set of amino acid descriptors and its application in peptide QSARs. Pept. Sci. 2005, 80, 775–786. [Google Scholar] [CrossRef]
- van Westen, G.J.; Swier, R.F.; Wegner, J.K.; IJzerman, A.P.; van Vlijmen, H.W.; Bender, A. Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): Comparative study of 13 amino acid descriptor sets. J. Cheminform. 2013, 5, 41. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yang, L.; Shu, M.; Ma, K.; Mei, H.; Jiang, Y.; Li, Z. ST-scale as a novel amino acid descriptor and its application in QSAM of peptides and analogues. Amino Acids 2010, 38, 805–816. [Google Scholar] [CrossRef] [PubMed]
- Zaliani, A.; Gancia, E. MS-WHIM scores for amino acids: A new 3D-description for peptide QSAR and QSPR studies. J. Chem. Inf. Comput. Sci. 1999, 39, 525–533. [Google Scholar] [CrossRef]
- Ikai, A. Thermostability and aliphatic index of globular proteins. J. Biochem. 1980, 88, 1895–1898. [Google Scholar]
- Boman, H.G. Antibacterial peptides: Basic facts and emerging concepts. J. Intern. Med. 2003, 254, 197–215. [Google Scholar] [CrossRef]
- Guruprasad, K.; Reddy, B.V.; Pandit, M.W. Correlation between stability of a protein and its dipeptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990, 4, 155–161. [Google Scholar] [CrossRef]
- Osorio, D.; Rondon-Villarreal, P.; Torres, R. Peptides: A package for data mining of antimicrobial peptides. R J. 2015, 7, 4–14. [Google Scholar] [CrossRef]
- Imai, K.; Nakai, K. Tools for the Recognition of Sorting Signals and the Prediction of Subcellular Localization of Proteins from Their Amino Acid Sequences. Front. Genet. 2020, 11, 1491. [Google Scholar] [CrossRef]
- Emanuelsson, O.; Nielsen, H.; Brunak, S.; Heijne, G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 2000, 300, 1005–1016. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Almagro Armenteros, J.J.; Tsirigos, K.D.; Sønderby, C.K.; Petersen, T.N.; Winther, O.; Brunak, S.; Heijne, G.; Nielsen, H. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 2019, 37, 420–423. [Google Scholar] [CrossRef] [PubMed]
- Käll, L.; Krogh, A.; Sonnhammer, E.L. A Combined Transmembrane Topology and Signal Peptide Prediction Method. J. Mol. Biol. 2004, 338, 1027–1036. [Google Scholar] [CrossRef] [PubMed]
- Krogh, A.; Larsson, B.; Heijne, G.; Sonnhammer, E.L. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 2001, 305, 567–580. [Google Scholar] [CrossRef] [Green Version]
- Nguyen Ba, A.N.; Pogoutse, A.; Provart, N.; Moses, A.M. NLStradamus: A simple Hidden Markov Model for nuclear localization signal prediction. BMC Bioinform. 2009, 10, 202. [Google Scholar] [CrossRef] [Green Version]
- Kumar, R.; Kumari, B.; Kumar, M. Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine. PeerJ 2017, 5, e3561. [Google Scholar] [CrossRef] [Green Version]
- Kumar, R.; Kumari, B.; Kumar, M. Proteome-wide prediction and annotation of mitochondrial and sub-mitochondrial proteins by incorporating domain information. Mitochondrion 2018, 42, 11–22. [Google Scholar] [CrossRef] [PubMed]
- Fernandez-Escamilla, A.M.; Rousseau, F.; Schymkowitz, J.; Serrano, L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat. Biotech. 2004, 22, 1302–1306. [Google Scholar] [CrossRef]
- Schaffer, A.A.; Aravind, L.; Madden, T.L.; Shavirin, S.; Spouge, J.L.; Wolf, Y.I.; Koonin, E.V.; Altschul, S.F. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001, 29, 2994–3005. [Google Scholar] [CrossRef] [Green Version]
- Imai, K.; Nakai, K. Prediction of subcellular locations of proteins: Where to proceed? Proteomics 2010, 10, 3970–3983. [Google Scholar] [CrossRef] [PubMed]
- Su, E.; Chang, J.; Cheng, C.; Sung, T.; Hsu, W. Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing. BMC Bioinform. 2012, 13, S13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gillis, J.; Pavlidis, P. Assessing identity, redundancy and confounds in gene ontology annotations over time. Bioinformatics 2013, 29, 476–482. [Google Scholar] [CrossRef] [Green Version]
- Yu, G.; Lu, C.; Wang, J. NoGOA: Predicting noisy GO annotations using evidences and sparse representation. BMC Bioinform. 2017, 18, 350. [Google Scholar] [CrossRef] [Green Version]
- Barrell, D.; Dimmer, E.; Huntley, R.P.; Binns, D.; O’Donovan, C.; Apweiler, R. The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 2009, 37, D396–D403. [Google Scholar] [CrossRef] [Green Version]
- Camon, E.; Magrane, M.; Barrell, D.; Lee, V.; Dimmer, E.; Maslen, J.; Binns, D.; Harte, N.; Lopez, R.; Apweiler, R. The Gene Ontology Annotation (GOA) Database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32, D262–D266. [Google Scholar] [CrossRef] [Green Version]
- Kira, K.; Rendell, L.A. A practical approach to feature selection. In Proceedings of the Ninth International Workshop on Machine Learning, Aberdeen, Scotland, 1–3 July 1992; pp. 249–256. [Google Scholar]
- Holte, R.C. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 1993, 11, 63–91. [Google Scholar] [CrossRef]
- Hall, M.A.; Holmes, G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 2003, 15, 1437–1447. [Google Scholar] [CrossRef] [Green Version]
- Gou, J. A Novel Weighted Voting for K-Nearest Neighbor Rule. J. Comput. 2011, 6, 833–840. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Type | Subcellular Location | Training Data (Original) | Training Data (25% CD-HIT) | Testing Data |
---|---|---|---|---|
Single location | Plastid | 2468 | 533 | 248 |
Cytoplasm | 351 | 351 | 40 | |
Extracellular | 140 | 140 | 14 | |
Nucleus | 568 | 568 | 63 | |
Mitochondrion | 447 | 447 | 52 | |
Cell membrane | 829 | 438 | 92 | |
Golgi Apparatus | 204 | 204 | 23 | |
Endoplasmic reticulum | 280 | 280 | 29 | |
Vacuole | 176 | 176 | 20 | |
Peroxisome | 57 | 57 | 6 | |
Cell wall | 37 | 37 | 5 | |
Multilocation | Mito-Plastid | 118 | 118 | 13 |
Cyto-Nucleus | 170 | 170 | 20 | |
Cyto-Golgi | 34 | 34 | 4 | |
Total | 5879 | 3553 | 629 |
Features (Total = 479 Features) | Abbreviation |
---|---|
Amino acid Composition | AAC1-AAC20 |
Amphiphilic PseAAC | APAAC1-APAAC30 |
BLOSUM matrix-derived | Blosum1-Blosum8 |
Composition descriptor of the CTD | CTDC1-CTDC21 |
Distribution descriptor of the CTD | CTDD1-CTDD105 |
Transition descriptor of the CTD | CTDT1-CTDT21 |
Geary autocorrelation | Geary1-Geary40 |
Pseudo amino acid composition | PAAC1-PAAC30 |
Parallel pseudo amino acid composition | PsePC1-PsePC22 |
Serial pseudo amino acid composition | PseSC1-PseSC26 |
Net charge | Charge |
Potential protein interaction index | Boman |
Aliphatic index of protein | aIndex |
Autocovariance index | autocov |
Crosscovariance1 | Crosscov1 |
Crosscovariance2 | Crosscov2 |
Cruciani covariance index | Crucian1-Crucian3 |
Factor analysis scales of generalized amino acid information | fasgai1-fasgai6 |
Hmoment alpha helix | Hmomonet1 |
Hmoment beta sheet | Hmoment2 |
Hydrophobicity index | hydrophobicity |
Instability index | Instaindex |
MS-WHIM scores derived from 36 electrostatic potential properties | mswhimscore1-mswhimscore 3 |
Isoelectric point (pI) | pI |
Average of protFP | protFP1-protFP8 |
ST-scale based on physicochemical properties | stscales1-stscales8 |
T-scale based on physicochemical properties | tscales1-tscales5 |
VHSE-scale based on physicochemical properties (vhsescales1 | vhsescales1-vhsescales8 |
Z-scale based on physicochemical properties | stscales1-stscales5 |
Quasi-sequence-order descriptor | QSO1-QSO60 |
Sequence-order-coupling numbers | SOCN1-SOCN20 |
Chloroplast transit peptide | cTP |
Mitochondrial transit peptide | mTP |
Signal peptide cleavage site score | SP |
Number of predicted transmembrane segments | TM |
Other location score from targetP | other |
Nuclear localization signal | NLS |
SVM score from Erpred | erpred |
SubmitoPred (SVM_score_mito) | SVM_mito |
SubmitoPred (SVM_inner_mem) | SVM_mem |
SubmitoPred (SVM_inter_mem) | SVM_inter |
SubmitoPred (SVM_score_matrix) | SVM_matrix |
SubmitoPred (SVM_score_outer_mem) | SVM_outer |
Aggregation (tango1) | Tango1 |
Amyloid (tango2) | Tango2 |
Turn-turns (tango3) | Tango3 |
Alpha-helices (tango4) | Tango4 |
Helical aggregation (tango5) | Tango5 |
Beta-strands (tango6) | Tango6 |
Homology based feature (GO term) | Homology |
Go term; ‘Cellular component’ |
GO:0005737; cytoplasm |
GO:0005783; endoplasmic reticulum |
GO:0005788; endoplasmic reticulum lumen |
GO:0005789; endoplasmic reticulum membrane |
GO:0005793; endoplasmic reticulum-Golgi intermediate compartment |
GO:0005615; extracellular space |
GO:0005794; Golgi apparatus |
GO:0005796; Golgi lumen |
GO:0000139; Golgi membrane |
GO:0005739; mitochondrion |
GO:0005740; mitochondrial envelope |
GO:0005743; mitochondrial inner membrane |
GO:0005758; mitochondrial intermembrane space |
GO:0005759; mitochondrial matrix |
GO:0031966; mitochondrial membrane |
GO:0005741; mitochondrial outer membrane |
GO:0005886; plasma membrane |
GO:0005618; cell wall |
GO:0005634; nucleus |
GO:0009536; plastid |
GO:0009528; plastid inner membrane |
GO:0005777; peroxisome |
GO:0005778; peroxisomal membrane |
GO:0005773; vacuole |
GO:0005774; vacuolar membrane |
GO:0016020; membrane |
GO:0009507; chloroplast |
All Features (479) | RF | KNN | XGB | HeteroEnsemble |
ACC | 82.72% | 82.62% | 85.97% | 91.00% |
MCC | 0.795 | 0.798 | 0.845 | 0.896 |
AUC | 0.977 | 0.897 | 0.975 | 0.993 |
OneR (87) | RF | KNN | XGB | HeteroEnsemble1 |
ACC | 92.02% | 91.51% | 93.87% | 93.76% |
MCC | 0.907 | 0.902 | 0.932 | 0.929 |
AUC | 0.995 | 0.991 | 0.993 | 0.996 |
ReliefF (95) | RF | KNN | XGB | HeteroEnsemble2 |
ACC | 94.27% | 89.57% | 93.46% | 93.97% |
MCC | 0.935 | 0.879 | 0.928 | 0.932 |
AUC | 0.996 | 0.991 | 0.992 | 0.996 |
CFS + Genetics (109) | RF | KNN | XGB | HeteroEnsemble3 |
ACC | 94.48% | 93.05% | 95.30% | 94.68% |
MCC | 0.938 | 0.921 | 0.948 | 0.94 |
AUC | 0.996 | 0.994 | 0.996 | 0.997 |
Type | Subcellular Location | Testing Data | Correctly Predicted | Percent | MCC |
---|---|---|---|---|---|
Single location | Plastid | 248 | 238 | 95.97% | 0.756 |
Cytoplasm | 40 | 34 | 85% | 0.829 | |
Extracellular | 14 | 9 | 64.28% | 0.756 | |
Nucleus | 63 | 61 | 96.82% | 0.854 | |
Mitochondrion | 52 | 31 | 59.61% | 0.708 | |
Cell membrane | 92 | 81 | 88.04% | 0.792 | |
Golgi Apparatus | 23 | 14 | 60.86% | 0.747 | |
Endoplasmic reticulum | 29 | 25 | 86.21% | 0.710 | |
Vacuole | 20 | 5 | 25% | 0.359 | |
Peroxisome | 6 | 3 | 50% | 0.705 | |
Cell wall | 5 | 5 | 100% | 1 | |
Total (Single location) | 592 | 506 | 85.47% | 0.747 | |
Multilocation | Mito-Plastid | 13 | 8 | 61.54% | 0.607 |
Cyto-Nucleus | 20 | 18 | 90% | 0.897 | |
Cyto-Golgi | 4 | 0 | 0% | 0 | |
Total (multilocation) | 37 | 26 | 70.27% | 0.501 | |
Total All | 629 | 532 | 84.58% | 0.694 |
Method | Machine Learning Technique | Accuracy (Single + Dual Label Data) | Accuracy (Dual Label Data) |
---|---|---|---|
YLoc [17] | Naïve Bayes | 34.35 | 35.89 |
Euk-mPloc 2.0 [28] | OET-KNN 1 | 53.5 | 44.86 |
iLoc-Plant [16] | ML-KNN 2 | 37.42 | 34.42 |
Plant-mSubP [11] | SVM 3 | 64.84 | 81.08 |
Our model | Ensemble | 84.58 | 70.27 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wattanapornprom, W.; Thammarongtham, C.; Hongsthong, A.; Lertampaiporn, S. Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization. Life 2021, 11, 293. https://doi.org/10.3390/life11040293
Wattanapornprom W, Thammarongtham C, Hongsthong A, Lertampaiporn S. Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization. Life. 2021; 11(4):293. https://doi.org/10.3390/life11040293
Chicago/Turabian StyleWattanapornprom, Warin, Chinae Thammarongtham, Apiradee Hongsthong, and Supatcha Lertampaiporn. 2021. "Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization" Life 11, no. 4: 293. https://doi.org/10.3390/life11040293
APA StyleWattanapornprom, W., Thammarongtham, C., Hongsthong, A., & Lertampaiporn, S. (2021). Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization. Life, 11(4), 293. https://doi.org/10.3390/life11040293