Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest
Abstract
:1. Introduction
2. Methods
2.1. Position-Specific Score Matrix
2.2. Kernel Linear Discriminant Analysis
- Map input samples to a higher dimensional space by nonlinear mapping function and the mapped samples can be expressed as ;
- Calculate the mean of all mapped samples and the mean of the mapped samples for class by the following formulas:
- Calculate intraclass covariance matrix and the interclass covariance matrix for the whole mapped samples using the follow formulas:
- Find the optimal projection direction by minimizing the intraclass distance and maximizing the interclass distance and the process can be expressed as Equation (8).Moreover, is the linear combination of , which can be expressed as follows:In Equation (8), is unknown and feature space F may not be unique, which means cannot be computed directly. Thus, the kernel trick is introduced to solve this problem and Equations (4) and (5) can be transcribed as Equations (10) and (11).Combined with Equations (6)–(11), the final criterion function of dimension reduction can be rewritten as follows:
- Obtain the finally rank-reduction projective matrix by .
2.3. The Proposed Radius-SMOTE
- Calculate the imbalance rate by ;
- Select a sample of minority , calculate its nearest neighbors of minority class and select a neighbor at random represented as ;
- Calculate the distance between and by Euclidean distance;
- Randomly take values from to generate a vector , where means the characteristics dimension of . Taking as the center as the radius, thus a circle can be defined as shown in Figure 3a; at the same time, a new sample can be inserted within this circle by Equation (13).
- According to the imbalance rate, repeat Steps 2 to 4 times. Finally, samples can be synthesized by Radius-SMOTE, as shown in Figure 3b.
2.4. Random Forest
- Randomly select the training subsets from the original dataset;
- Set up a decision tree for each training subset, in which each decision tree does not need to be pruned;
- Construct RF model by formed forest that is composed of tens or hundreds of decision trees;
- Classify a new sample, each decision tree in the forest gives an individual result;
- Calculate the votes of each class and get the final class which has the supreme votes.
3. Experiments
3.1. Datasets
3.2. Evaluation Indexes
3.3. The Analysis of Unbalance Datasets
3.4. The Overall Accuracy Analysis of the Proposed Method
3.4.1. The Relationship between k in Radius-SMOTE with Overall Accuracy
3.4.2. The Relationship between k in RF with Overall Accuracy
3.5. The Analysis for Evaluation Indexes of Different Methods
3.6. Comparisons with Other Methods
3.6.1. Comparison of Dataset 1
3.6.2. Comparison of Dataset 2
4. Conclusions
- The imbalance of protein datasets has a great impact on the prediction accuracy of protein subnuclear localization;
- The proposed method can efficiently improve the prediction accuracy of protein subnuclear localization by solving the imbalanced problem of protein datasets;
- The combination of KLDA and RF can improve the classification accuracy of protein at the subnuclear level.
Author Contributions
Funding
Conflicts of Interest
References
- Garapati, H.S.; Male, G.; Mishra, K. Predicting subcellular localization of proteins using protein-protein interaction data. Genomics 2020, 112, 2361–2368. [Google Scholar] [CrossRef]
- Javed, F.; Hayat, M. Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou’s PseAAC. Genomics 2019, 111, 1325–1332. [Google Scholar]
- Gardy, J.L.; Brinkman, F.S. Methods for predicting bacterial protein subcellular localization. Nat. Rev. Microbiol. 2006, 4, 741–751. [Google Scholar]
- Yu, B.; Li, S.; Chen, C.; Xu, J.; Qiu, W.; Wu, X.; Chen, R. Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou’s pseudo amino acid composition. Chemom. Intell. Lab. Syst. 2017, 167, 102–112. [Google Scholar]
- Wang, S.; Yue, Y. Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm. PLoS ONE 2019, 13, e0195636. [Google Scholar] [CrossRef]
- Wang, S.; Liu, S. Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int. J. Mol. Sci. 2015, 16, 30343–30361. [Google Scholar]
- Nakashima, H.; Nishikawa, K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 1994, 238, 54–61. [Google Scholar]
- Reinhardt, A. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 1998, 26, 2230–2236. [Google Scholar] [CrossRef] [Green Version]
- Chou, K.C.; Shen, H.B. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J. Proteome Res. 2006, 5, 1888–1897. [Google Scholar] [CrossRef]
- Chou, K.C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001, 43, 246–255. [Google Scholar] [CrossRef]
- Chou, K.C. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 2000, 278, 477–483. [Google Scholar] [CrossRef] [PubMed]
- Hayat, M.; Iqbal, N. Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine. Comput. Methods Programs Biomed. 2014, 116, 184–192. [Google Scholar] [CrossRef]
- Nanni, L.; Brahnam, S.; Lumini, A. Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition. J. Theor. Biol. 2014, 360, 109–116. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.B.; Chen, C.; Li, Z.C.; Zou, X.Y. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J. Theor. Biol. 2007, 248, 546–551. [Google Scholar] [CrossRef] [PubMed]
- Liu, L.; Cai, Y.; Lu, W.; Feng, K.; Peng, C.; Niu, B. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection. Biochem. Biophys. Res. Commun. 2009, 380, 318–322. [Google Scholar] [CrossRef] [PubMed]
- Li, B.; Cai, L.; Liao, B.; Fu, X.; Bing, P.; Yang, J. Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features. Molecules 2019, 24, 919. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.C. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43, 65–71. [Google Scholar] [CrossRef] [Green Version]
- Gribskov, M.; McLachlan, A.; Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4355–4358. [Google Scholar] [CrossRef] [Green Version]
- Shen, H.B.; Chou, K.C. Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng. Des. Sel. 2007, 20, 561–567. [Google Scholar] [CrossRef]
- Li, L.; Yu, S.; Xiao, W.; Li, Y.; Li, M.; Huang, L.; Zheng, X.; Zhou, S.; Yang, H. Prediction of bacterial protein subcellular localization by incorporating various features into Chou’s PseAAC and a backward feature selection approach. Biochimie 2014, 104, 100–107. [Google Scholar] [CrossRef]
- Yao, Y.; Xu, H.; He, P.; Dai, Q. Recent advances on prediction of protein subcellular localization. Mini-Rev. Org. Chem. 2015, 12, 481–492. [Google Scholar]
- Chou, K.C.; Shen, H.B. Recent progress in protein subcellular location prediction. Anal. Biochem. 2007, 370, 1–16. [Google Scholar]
- Armenteros, J.J.; Sonderby, C.K.; Sonderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 2017, 33, 3387–3395. [Google Scholar]
- Chou, K.C.; Shen, H.B. Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun. 2006, 347, 150–157. [Google Scholar]
- Science, C.; Trust, W. Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning. Genes Genomes Genet. 2017, 7, 1385–1392. [Google Scholar]
- Hasan, M.A.; Ahmad, S.; Molla, M.K. Protein subcellular localization prediction using multiple kernel learning based support vector machine. Mol. Biosyst. 2017, 13, 785–795. [Google Scholar]
- Tu, Y.K.; Hong, Y.Y.; Chen, Y.C. Finite element modeling of kirschner pin and bone thermal contact during drilling. Life Sci. J. 2009, 6, 23–27. [Google Scholar]
- Li, Y.; Li, L.P.; Wang, L.; Yu, C.Q.; Wang, Z.; You, Z.H. An Ensemble Classifier to Predict Protein–Protein Interactions by Combining PSSM-based Evolutionary Information with Local Binary Pattern Model. Int. J. Mol. Sci. 2019, 20, 3511. [Google Scholar]
- Xiao, X.; Wu, Z.C.; Chou, K.C. iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J. Theor. Biol. 2011, 284, 42–51. [Google Scholar]
- Chou, K.C.; Wu, Z.C.; Xiao, X. iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 2011, 6, e18258. [Google Scholar]
- Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; Mullers, K.R. Fisher discriminant analysis with kernels. IEEE Signal. Process. Soc. Workshop 1999, 41–48. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar]
- Gajowniczek, K.; Grzegorczyk, I.; Ząbkowski, T.; Bajaj, C. Weighted Random Forests to Improve Arrhythmia Classification. Electronics 2020, 9, 99. [Google Scholar]
- Kumar, R.; Jain, S.; Kumari, B.; Kumar, M. Protein sub-nuclear localization prediction using SVM and Pfam domain information. PLoS ONE 2014, 9, e98345. [Google Scholar] [CrossRef] [Green Version]
- Chou, K.C.; Liu, W.M.; Maggiora, G.M.; Zhang, C.T. Prediction and classification of domain structural classes. Proteins Struct. Funct. Genet. 1998, 31, 97–103. [Google Scholar] [CrossRef]
- Cheng, X.; Zhao, S.G.; Lin, W.Z.; Xiao, X.; Chou, K.C. PLoc-mAnimal: Predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 2017, 33, 3524–3531. [Google Scholar]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Adv. Intell. Comput. 2005, 3644, 878–887. [Google Scholar]
- Farquad, M.A.H.; Bose, I. Preprocessing unbalanced data using support vector machine. Decis. Support. Syst. 2012, 53, 226–233. [Google Scholar]
- William, A.R. Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets. Inf. Sci. 2017, 408, 146–161. [Google Scholar]
- Yue, Y.; Wang, S. Protein subnuclear location based on KLDA with fused kernel and effective fusion representation. In Proceedings of the 6th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 21–22 October 2017. [Google Scholar]
- Song, C. Protein Subnuclear Localization Using a Hybrid Classifier Combined with Chou’s Pseudo Amino Acid Composition. In Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 13–15 October 2018. [Google Scholar]
Class | Subnuclear Localization Name | Number |
---|---|---|
1 | Chromatin proteins (Ca) | 99 |
2 | Heterochromatin proteins (Ht) | 22 |
3 | Nuclear envelope proteins (Ne) | 61 |
4 | Nuclear matrix proteins (Nm) | 29 |
5 | Nuclear pore complex proteins (Nc) | 79 |
6 | Nuclear speckle proteins (Ns) | 67 |
7 | Nucleolus proteins (Nl) | 307 |
8 | Nucleoplasm proteins (Np) | 37 |
9 | Nuclear PML body proteins (Nb) | 13 |
Sum | 714 |
Class | Subnuclear Localization Name | Number |
---|---|---|
1 | Centromere proteins (Cn) | 86 |
2 | Chromosome proteins (Co) | 113 |
3 | Nuclear speckle proteins (Ns) | 50 |
4 | Nucleolus proteins (Nl) | 294 |
5 | Nuclear envelope proteins (Ne) | 17 |
6 | Nuclear matrix proteins (Nm) | 18 |
7 | Nucleoplasm proteins (Np) | 30 |
8 | Nuclear pore complex proteins (Nc) | 12 |
9 | Nuclear proteins (Na) | 12 |
10 | PML body Telomere(Pb) | 37 |
Sum | 669 |
Dataset 1 | Index | Original Data (KLDA + RF) | SMOTE (KLDA + RF) | Radius-SMOTE (KLDA + RF) |
---|---|---|---|---|
Ca | Se | 0.546 | 0.815 | 0.953 |
Sp | 0.890 | 0.985 | 0.988 | |
ACC | 0.832 | 0.964 | 0.984 | |
MCC | 0.423 | 0.830 | 0.925 | |
Ht | Se | 0.455 | 0.980 | 0.984 |
Sp | 0.996 | 0.973 | 0.999 | |
ACC | 0.972 | 0.975 | 0.997 | |
MCC | 0.604 | 0.897 | 0.985 | |
Ne | Se | 0.508 | 0.882 | 0.980 |
Sp | 0.940 | 0.988 | 0.992 | |
ACC | 0.891 | 0.974 | 0.991 | |
MCC | 0.451 | 0.883 | 0.958 | |
Nm | Se | 0.31 | 0.921 | 0.979 |
Sp | 0.986 | 0.994 | 0.996 | |
ACC | 0.947 | 0.985 | 0.994 | |
MCC | 0.393 | 0.928 | 0.971 | |
Nc | Se | 0.519 | 0.798 | 0.924 |
Sp | 0.919 | 0.992 | 0.998 | |
ACC | 0.863 | 0.972 | 0.991 | |
MCC | 0.436 | 0.839 | 0.946 | |
Ns | Se | 0.388 | 0.873 | 0.974 |
Sp | 0.927 | 0.996 | 0.993 | |
ACC | 0.863 | 0.982 | 0.991 | |
MCC | 0.326 | 0.907 | 0.955 | |
Nl | Se | 0.958 | 0.886 | 0.876 |
Sp | 0.766 | 0.949 | 0.994 | |
ACC | 0.872 | 0.941 | 0.98 | |
MCC | 0.747 | 0.761 | 0.901 | |
Np | Se | 0.432 | 0.865 | 0.983 |
Sp | 0.985 | 0.994 | 0.997 | |
ACC | 0.945 | 0.978 | 0.995 | |
MCC | 0.522 | 0.897 | 0.977 | |
Nb | Se | 0.154 | 0.983 | 1.000 |
Sp | 1.000 | 0.996 | 0.999 | |
ACC | 0.978 | 0.994 | 0.999 | |
MCC | 0.388 | 0.975 | 0.994 |
Dataset 2 | Index | Original Data (KLDA + RF) | SMOTE (KLDA + RF) | Radius-SMOTE (KLDA + RF) |
---|---|---|---|---|
Cn | Se | 0.698 | 0.849 | 0.969 |
Sp | 0.858 | 0.993 | 0.991 | |
ACC | 0.834 | 0.978 | 0.989 | |
MCC | 0.479 | 0.878 | 0.939 | |
Co | Se | 0.726 | 0.774 | 0.854 |
Sp | 0.846 | 0.992 | 0.992 | |
ACC | 0.822 | 0.973 | 0.980 | |
MCC | 0.515 | 0.821 | 0.869 | |
Ns | Se | 0.320 | 0.900 | 0.988 |
Sp | 0.953 | 0.993 | 0.994 | |
ACC | 0.893 | 0.984 | 0.994 | |
MCC | 0.310 | 0.908 | 0.963 | |
Nl | Se | 0.932 | 0.881 | 0854 |
Sp | 0.817 | 0.948 | 0.995 | |
ACC | 0.881 | 0.940 | 0.979 | |
MCC | 0.759 | 0.742 | 0.890 | |
Ne | Se | 0.118 | 0.965 | 0.983 |
Sp | 0.998 | 0.997 | 0.999 | |
ACC | 0.967 | 0.993 | 0.998 | |
MCC | 0.271 | 0.967 | 0.988 | |
Nm | Se | 0.111 | 0.962 | 0.983 |
Sp | 0.992 | 0.976 | 0.996 | |
ACC | 0.959 | 0.992 | 0.995 | |
MCC | 0.175 | 0.961 | 0.975 | |
Np | Se | 0.233 | 0.926 | 0.993 |
Sp | 0.979 | 0.990 | 0.997 | |
ACC | 0.934 | 0.983 | 0.996 | |
MCC | 0.278 | 0.911 | 0.980 | |
Nc | Se | 0.333 | 0.962 | 1.000 |
Sp | 0.996 | 0.997 | 0.999 | |
ACC | 0.979 | 0.993 | 0.999 | |
MCC | 0.462 | 0.965 | 0.994 | |
Na | Se | 0.167 | 0.951 | 1.000 |
Sp | 0.998 | 0.998 | 0.999 | |
ACC | 0.977 | 0.993 | 0.999 | |
MCC | 0.326 | 0.964 | 0.994 | |
Pb | Se | 0.460 | 0.915 | 0.996 |
Sp | 0.980 | 0.993 | 0.996 | |
ACC | 0.941 | 0.985 | 0.996 | |
MCC | 0.519 | 0.920 | 0.979 |
Oversampling Methods | Overall Accuracy (%) |
---|---|
Borderline_SMOTE1 [37] | 92.4 |
Borderline_SMOTE2 [37] | 93.3 |
SVM_balance [38] | 93.1 |
NRAS [39] | 89.3 |
The proposed Radius-SMOTE | 96.1 |
Methods (Jackknife Test) | Overall Accuracy (%) |
---|---|
Fusion of PsePSSM and PseAAC-KNN [19] | 67.4 |
PseAAPSSM-LDA-KNN [6] | 88.1 |
DipPSSM-LDA-KNN [6] | 95.9 |
AACPSSM with fused kernel-KLDA-KNN [40] kernel | 94.7 |
PseAAC-A hybrid-classifier-based SVM [41] classifier | 81.2 |
CoPSSM-KLDA-based DGGA-KNN [5] | 90.3 |
The proposed PSSM-Radius-SMOTE-KLDA-RF | 96.1 |
Oversampling Methods | Overall Accuracy (%) |
---|---|
Borderline_SMOTE1 [37] | 93.8 |
Borderline_SMOTE2 [37] | 93.6 |
SVM_balance [38] | 95.6 |
NRAS [39] | 88.1 |
The proposed Radius-SMOTE | 95.7 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, L.; Huang, S.; Wu, F.; Jiang, Q.; Yao, S.; Jin, X. Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest. Electronics 2020, 9, 1566. https://doi.org/10.3390/electronics9101566
Wu L, Huang S, Wu F, Jiang Q, Yao S, Jin X. Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest. Electronics. 2020; 9(10):1566. https://doi.org/10.3390/electronics9101566
Chicago/Turabian StyleWu, Liwen, Shanshan Huang, Feng Wu, Qian Jiang, Shaowen Yao, and Xin Jin. 2020. "Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest" Electronics 9, no. 10: 1566. https://doi.org/10.3390/electronics9101566
APA StyleWu, L., Huang, S., Wu, F., Jiang, Q., Yao, S., & Jin, X. (2020). Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest. Electronics, 9(10), 1566. https://doi.org/10.3390/electronics9101566