Support Vector Machine Classifier for Accurate Identification of piRNA
Abstract
:Featured Application
Abstract
1. Introduction
2. Methods
2.1. Datasets
2.2. Sequence Information
2.2.1. Pse-Nucleotide Composition
2.2.2. Split-Position-Specific Matrix
2.2.3. Six RNA Dimer’s Physicochemical Properties
2.2.4. Feature Optimization
2.3. SVM Implementation and Parameter Selection
2.4. Model Construction and Evaluation
3. Results and Discussion
Algorithm 1: The Predictive Algorithm for piRNAPred. |
Input: is a set of samples and the number of categories is c. |
Output: The prediction label of each sample. |
For each do |
Take as the test sample, and the others as the training dataset. |
Extract features. |
Predict the category. |
end |
3.1. Window Size Optimization for Bi-Profile Bayes
3.2. Feature Selection for the First Layer
3.3. Feature Selection for the Second Layer
3.4. Performance of 2L-piRNAPred
3.5. Comparison for Different Classifiers
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Aravin, A.; Gaidatzis, D.; Pfeffer, S.; Lagos-Quintana, M.; Landgraf, P.; Iovino, N.; Morris, P.; Brownstein, M.J.; Kuramochi-Miyagawa, S.; Nakano, T.; et al. A novel class of small RNAs bind to MILI protein in mouse testes. Nature 2006, 442, 203–207. [Google Scholar] [CrossRef] [PubMed]
- Grivna, S.T.; Beyret, E.; Wang, Z.; Lin, H. A novel class of small RNAs in mouse spermatogenic cells. Gene Dev. 2006, 20, 1709–1714. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Grivna, S.T.; Pyhtila, B.; Lin, H. MIWI associates with translational machinery and PIWI–interacting RNAs (piRNAs) in regulating spermatogenesis. Proc. Natl. Acad. Sci. USA 2006, 103, 13415–13420. [Google Scholar] [CrossRef] [PubMed]
- Goh, W.S.; Falciatori, I.; Tam, O.H.; Burgess, R.; Meikar, O.; Kotaja, N.; Hammell, M.; Hannon, G.J. piRNA–directed cleavage of meiotic transcripts regulates spermatogenesis. Gene Dev. 2015, 29, 1032–1044. [Google Scholar] [CrossRef] [PubMed]
- Gong, S.H. Identification and verification of potential piRNAs from domesticated yak testis. Reproduction 2018, 155, 117–127. [Google Scholar] [CrossRef] [PubMed]
- Zhang, D.; Tu, S.; Stubna, M.; Wu, W.S.; Huang, W.C.; Weng, Z.; Lee, H.C. The piRNA targeting rules and the resistance to piRNA silencing in endogenous genes. Science 2018, 359, 587–592. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Svendsen, J.M.; Montgomery, T.A. piRNA Rules of Engagement. Dev. Cell 2018, 4, 657–658. [Google Scholar] [CrossRef] [PubMed]
- Wu, W.S.; Huang, W.C.; Brown, J.S.; Zhang, D.; Song, X.; Chen, H.; Tu, S.; Weng, Z.; Lee, H.C. pirScan: A webserver to predict piRNA targeting sites and to avoid transgene silencing in C. elegans. Nucleic Acids Res. 2018, 46, W43–W48. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wang, X.; Kang, L. A k–mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 2011, 27, 771–776. [Google Scholar] [CrossRef] [PubMed]
- Wang, K.; Liang, C.; Liu, J.; Xiao, H.; Huang, S.; Xu, J.; Li, F. Prediction of piRNAs using transposon interaction and a support vector machine. BMC Bioinform. 2014, 15, 419. [Google Scholar] [CrossRef] [PubMed]
- Luo, L.; Li, D.; Zhang, W.; Tu, S.; Zhu, X.; Tian, G. Accurate prediction of transposon–derived piRNAs by integrating various sequential and physicochemical features. PLoS ONE 2016, 11, e0153268. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Luo, L.; Zhang, W.; Liu, F.; Luo, F. A genetic algorithm–based weighted ensemble method for predicting transposon–derived piRNAs. BMC Bioinform. 2016, 17, 329. [Google Scholar] [CrossRef] [PubMed]
- Liu, B.; Yang, F.; Chou, K.C. 2L–piRNA: A Two–Layer Ensemble Classifier for Identifying Piwi–Interacting RNAs and Their Function. Mol. Ther. Nucleic Acids 2017, 16, 267–277. [Google Scholar] [CrossRef] [PubMed]
- Zhang, P.; Si, X.; Skogerbø, G.; Wang, J.; Cui, D.; Li, Y.; Sun, X.; Liu, L.; Sun, B.; Chen, R.; et al. piRBase: A web resource assisting piRNA functional study. Database 2014, 2014, 110. [Google Scholar] [CrossRef] [PubMed]
- Bu, D.; Yu, K.; Sun, S.; Xie, C.; Skogerbø, G.; Miao, R.; Xiao, H.; Liao, Q.; Luo, H.; Zhao, G.; et al. NONCODE v3. 0: Integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2012, 40, D210–D215. [Google Scholar] [CrossRef] [PubMed]
- Brett, T.; Anthony, K. Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights. Bioinformatics 2013, 29, 686–694. [Google Scholar] [Green Version]
- López, Y.; Dehzangi, A.; Lal, S.P.; Taherzadeh, G.; Michaelson, J.; Sattar, A.; Tsunoda, T.; Sharma, A. SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Anal. Biochem. 2017, 527, 24–32. [Google Scholar] [CrossRef] [PubMed]
- Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.C. Pse–in–One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43, W65–W71. [Google Scholar] [CrossRef] [PubMed]
- Shao, J.; Xu, D.; Tsai, S.N.; Wang, Y.; Ngai, S.M. Computational identification of protein methylation sites through bi–profile Bayes feature extraction. PLoS ONE 2009, 4, e4920. [Google Scholar] [CrossRef] [PubMed]
- Song, J.; Tan, H.; Shen, H.; Mahmood, K.; Boyd, S.E.; Webb, G.I.; Akutsu, T.; Whisstock, J.C. Cascleave: Towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics 2010, 26, 752–760. [Google Scholar] [CrossRef] [PubMed]
- Jia, C.; Liu, T.; Chang, A.K.; Zhai, Y. Prediction of mitochondrial proteins of malaria parasite using bi–profile Bayes feature extraction. Biochimie 2011, 93, 778–782. [Google Scholar] [CrossRef] [PubMed]
- Jia, C.Z.; Zuo, Y.; Zou, Q. O–GlcNAcPRED–II: An integrated classification algorithm for identifying O–GlcNAcylation sites based on fuzzy undersampling and a K–means PCA oversampling technique. Bioinformatics 2018, 34, 2029–2036. [Google Scholar] [CrossRef] [PubMed]
- Senawi, A.; Wei, H.L.; Billings, S.A. A new maximum relevance–minimum multicollinearity (MRmMC) method for feature selection and ranking. Pattern Recognit. 2017, 67, 47–61. [Google Scholar] [CrossRef]
- Chen, L.; Chen, W.; Cheng, Q.; Wu, Y.; Krishnan, S.; Zou, Q. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing 2014, 123, 424–435. [Google Scholar]
- Li, S.; Li, D.; Zeng, X.X.; Wu, Y.F.; Li, G.; Zou, Q. nDNA–prot: Identification of DNA–binding Proteins Based on Unbalanced Classification. BMC Bioinform. 2014, 15, 298. [Google Scholar]
- Li, D.; Ju, Y.; Zou, Q. Protein Folds Prediction with Hierarchical Structured SVM. Curr. Proteom. 2016, 13, 79–85. [Google Scholar] [CrossRef]
- Wei, L.Y.; Tang, J.J.; Zou, Q. Local–DPP: An Improved DNA–binding Protein Prediction Method by Exploring Local Evolutionary Information. Inf. Sci. 2017, 384, 135–144. [Google Scholar] [CrossRef]
- Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
- Meher, P.K.; Sahu, T.K.; Saini, V.; Rao, A.R. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci. Rep. 2017, 7, 42362. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Farman, A.; Maqsood, H. Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou’s general PseAAC. J. Theor. Biol. 2015, 384, 78–83. [Google Scholar]
- Rahimi, M.; Bakhtiarizadeh, M.R.; Mohammadi-Sangcheshmeh, A. OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou’s pseudo amino acid composition. J. Theor. Biol. 2017, 415, 13–19. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Qiu, J.D.; Shi, S.P.; Suo, S.B.; Huang, S.Y.; Liang, R.P. Incorporating key position and amino acid residue features to identify general and species–specific Ubiquitin conjugation sites. Bioinformatics 2013, 29, 1614–1622. [Google Scholar] [CrossRef] [PubMed]
- Jia, C.Z.; Zhang, J.J.; Gu, W.Z. RNA–MethylPred: A high–accuracy predictor to identify N6-methyladenosine in RNA. Anal. Biochem. 2016, 510, 72–75. [Google Scholar] [CrossRef] [PubMed]
- Rodríguez-Fdez, I.; Canosa, A.; Mucientes, M.; Bugarín, A. STAC: A web platform for the comparison of algorithms using statistical tests. In Proceedings of the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Istanbul, Turkey, 2–5 August 2015; pp. 1–8. [Google Scholar]
Physicochemical Property | ||||||
---|---|---|---|---|---|---|
Dimer | Rise | Roll | Shift | Slide | Tilt | Twist |
AA | −0.862 | −0.689 | −1.163 | 1.386 | −1.896 | −0.27 |
AC | −0.149 | −1.698 | 1.545 | 0.51 | 0.555 | 0.347 |
AG | 0.565 | 0 | −0.813 | 0.127 | 0.096 | −0.888 |
AU | −0.149 | −0.643 | −0.988 | 0.894 | 1.015 | 0.965 |
CA | −1.931 | 0.643 | 0.497 | 0.346 | 0.862 | −0.27 |
CC | 0.802 | 0.092 | −0.551 | −0.1407 | −0.211 | 0.347 |
CG | 0.565 | 1.652 | 2.156 | −2.009 | −0.823 | −2.741 |
CU | 0.565 | 0 | −0.813 | 0.127 | 0.096 | −0.888 |
GA | 1.515 | 0.413 | 0.147 | −0.969 | 1.321 | 0.347 |
GC | −0.386 | −1.102 | 0.147 | 0.729 | −0.67 | 2.201 |
GG | 0.802 | 1.652 | −0.551 | −1.407 | −0.211 | 0.347 |
GU | −0.149 | −1.698 | 1.545 | 0.51 | 0.555 | 0.347 |
UA | 0.089 | 1.01 | −0.639 | 0.401 | −0.977 | 0.347 |
UC | 1.515 | 0.413 | 0.147 | −0.969 | 1.321 | 0.347 |
UG | −1.931 | 0.643 | 0.497 | 0.346 | 0.862 | −0.27 |
UU | −0.862 | −0.689 | −1.163 | 1.386 | −1.896 | −0.27 |
Cross-Validation | Features | Dimension | Sn (%) | Sp (%) | Acc (%) | MCC |
---|---|---|---|---|---|---|
BPB | 15 | 88.4 | 77.1 | 82.8 | 0.660 | |
SNC | 4 | 45.4 | 80.62 | 63.0 | 0.279 | |
5-fold | DNC | 8 | 86.6 | 82.1 | 84.3 | 0.687 |
TNC | 56 | 85.3 | 82.6 | 83.9 | 0.678 | |
PP | 84 | 84.5 | 79.1 | 81.8 | 0.636 | |
BPB + SNC + DNC + TNC + PP | - | 90.4 | 87.5 | 89.0 | 0.779 | |
Jackknife | BPB + SNC + DNC + TNC + PP | - | 90.4 | 87.9 | 89.2 | 0.784 |
Cross-Validation | Features | Dimension | Sn (%) | Sp (%) | Acc (%) | MCC |
---|---|---|---|---|---|---|
5-fold | BPB | 15 | 73.0 | 73.0 | 72.9 | 0.459 |
SNC | 4 | 69.2 | 73.6 | 71.4 | 0.427 | |
DNC | 10 | 75.2 | 73.6 | 74.3 | 0.488 | |
TNC | 48 | 77.8 | 77.6 | 77.7 | 0.554 | |
PP | 24 | 74.7 | 74.2 | 74.3 | 0.489 | |
BPB + SNC + DNC + TNC + PP | 101 | 80.0 | 77.3 | 78.7 | 0.573 | |
BPB + SNC + DNC + TNC + PP + KNN (N-terminal) | 111 | 80.1 | 79.3 | 79.8 | 0.598 | |
BPB + SNC + DNC+ PP + KNN (N- and C-terminals) | 121 | 84.3 | 83.6 | 84.0 | 0.68 | |
Jackknife | BPB + SNC + DNC + PP + KNN (N- and C-terminals) | 121 | 85.1 | 83.2 | 84.1 | 0.683 |
Methodology | Sn (%) | Sp (%) | Acc (%) | MCC |
---|---|---|---|---|
First Layer | ||||
piRNAPred | 90.4 | 87.5 | 89 | 0.779 |
2L-piRNA | 88.3 | 83.9 | 86.1 | 0.723 |
Accurate piRNA prediction | 83.1 | 82.1 | 82.6 | 0.651 |
GA-WE | 90.6 | 78.3 | 84.4 | 0.694 |
Second Layer | ||||
piRNAPred | 84.3 | 83.6 | 84 | 0.68 |
2L-piRNA | 79.1 | 76 | 77.6 | 0.552 |
Method | Sn (%) | Sp (%) | Acc (%) | MCC |
---|---|---|---|---|
First Layer | ||||
SVM | 90.4 | 87.5 | 89 | 0.779 |
RF | 85.8 | 88.4 | 87.1 | 0.743 |
KNN | 88.7 | 83.6 | 86.1 | 0.724 |
Ensemble | 89.9 | 87.0 | 88.5 | 0.770 |
Second Layer | ||||
SVM | 84.3 | 83.6 | 84.0 | 0.680 |
RF | 72.9 | 72.8 | 72.9 | 0.457 |
KNN | 73.3 | 69.7 | 71.5 | 0.431 |
Ensemble | 75.9 | 73.6 | 74.8 | 0.495 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, T.; Gao, M.; Song, R.; Yin, Q.; Chen, Y. Support Vector Machine Classifier for Accurate Identification of piRNA. Appl. Sci. 2018, 8, 2204. https://doi.org/10.3390/app8112204
Li T, Gao M, Song R, Yin Q, Chen Y. Support Vector Machine Classifier for Accurate Identification of piRNA. Applied Sciences. 2018; 8(11):2204. https://doi.org/10.3390/app8112204
Chicago/Turabian StyleLi, Taoying, Mingyue Gao, Runyu Song, Qian Yin, and Yan Chen. 2018. "Support Vector Machine Classifier for Accurate Identification of piRNA" Applied Sciences 8, no. 11: 2204. https://doi.org/10.3390/app8112204
APA StyleLi, T., Gao, M., Song, R., Yin, Q., & Chen, Y. (2018). Support Vector Machine Classifier for Accurate Identification of piRNA. Applied Sciences, 8(11), 2204. https://doi.org/10.3390/app8112204