In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins
Abstract
:1. Introduction
2. Results
2.1. Selection of the Best Classifier for Sub-Peroxisomal Prediction
2.2. In-PERO a Tool for the Prediction of Peroxisomal Protein Sub-Localisation
- Input of the protein sequence in FASTA format.
- Calculation of the statistical representation of the protein sequence using the UniRep () and the SeqVec () embeddings.
- Merging of the two statistical representation to obtain a 2924-dimensional representation of the protein sequence.
- Prediction of the subcellular localization using the trained SVM prediction model.
2.3. Validation of Sub-Peroxisomal Membrane Protein Prediction
2.4. Extending In-Pero to Predict Sub-Mitochondrial Proteins
3. Discussion
4. Materials and Methods
4.1. Overview of the Full Comparison Workflow
- Data curation: Retrieval of peroxisome protein sequence from UniProt, clustering and filtering.
- Feature extraction: Transformation of the protein sequences into numerical representations capturing protein characteristics using classical encodings (1HOT, PROP and PSSM) and deep-learning embeddings (UniRep and SeqVec).
- Full comparison. Double cross-validated assessment of the prediction capability of different combination of machine-learning approaches (Logistic Regression, Support Vector Machines, Partial Least Square Discriminant Analysis and Random Forest) and protein sequence encodings and embeddings using Step Forward Feature Selection.
4.2. Data Sets
4.2.1. Retrieval of Peroxisomal Membrane Proteins
4.2.2. Retrieval of Peroxisomal Matrix Proteins
4.2.3. Retrieval of Candidate Peroxisomal Proteins
4.2.4. Data Sets for Sub-Mitochondrial Protein Classification
4.3. Classic Protein Sequence Encoding Methods
- Residue one-hot encoding. The one-hot encoding (1-HOT) [19] is the most used binary encoding method. A residue j is represented by a vector containing 0 s except in the j-th position; for instance alanine (A) is represented as 100,000,000,000,000,000,000. A protein sequence constituted by L amino acid is thus represented by an matrix.
- Residue physical-chemical properties encoding. Akinori et al. devised a way to represent an amino-acid with ten factors [20] summarising different amino acid physico-chemical properties. This encoding method, often abbreviated as PROP, is the most commonly used physico-chemical encoding [19]. Any given residue j in the protein sequence is represented by a vector containing real number. Each number summarise different amino-acid properties and it is an orthogonal property obtained after multivariate statistical analysis applied to a starting set of 188 residue-specific physical properties. A protein sequence constituted by L amino acid is thus represented by an matrix.
- The Position-specific scoring matrix (PSSM) [21,22] takes into account the evolutionary information of a protein. This scoring matrix is at the basis of protein BLAST searches (BLAST and PSI-BLAST) [37] where residues are translated into substitution scores. A residue j in the protein sequence is represented by a vector containing the 20 specific substitution scores. Amino acid substitution scores are given separately for each position of the protein multiple sequence alignment (MSA) after running PSI-BLAST [37] against the Uniref90 data set (release Oct 2019) for three iterations and e-value threshold set to 0.001. We used a sigmoid function to map the values extracted from the PSI-BLAST checkpoint file in the range [0–1], as in DeepMito [15]. Basically, PSSM captures the conservation pattern in the alignment and summarises evolutionary information of the protein. In PSSM a protein sequence constituted by L amino acid is thus represented by an matrix.
4.4. Deep Learning Protein Sequence Embeddings
- Unified Representation. The Unified Representation (UniRep) [12] is based on a recurrent neural network architecture (1900-hidden unite) able to capture chemical, biological and evolutionary information encoded in the protein sequence starting from ∼24 million UniRef50 sequences [38]. Technically, the protein sequence is modelled by using a hidden state vector, which is recursively updated based on the previous hidden state vector. This means that the method learns scanning a sequence of amino acids, predicting the next one based on the sequence it has seen so far. Using UniRep a protein sequence can be represented by an embedding of length 64, 256, or 1900 units depending on the neural network architecture used. In this study, we used the 1900 units long (average final hidden array). For a detailed explanation on how to retrieve the UniRep embedding, we refer the reader to the specific GitHub repository:https://github.com/churchlab/UniRep (accessed on 6 June 2021).
- Sequence-to-Vector embedding. The Sequence-to-Vector embedding (SeqVec) [13] embeds biophysical information of a protein sequence taking a natural language processing approach considering amino acids as words and proteins as sentences. SeqVec is obtained by training ELMo [39], a deep contextualised word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts, which consists of a 2-layer bidirectional LSTM [40] backbone pre-trained on a large text corpus, in this case, UniRef50 [38]. The SeqVec embedding can be obtained by training ELMo at the per-residue (word-level) and per-protein (sentence-level). With the per-residue level it is possible to obtain a protein sequence embedding that can be use to predict the secondary structure or intrinsically disordered region; with the per-protein level embedding it is possible to predict subcellular localisation and to distinguish membrane-bound vs. water-soluble proteins [13]. Here we use the per-protein level representation, where the protein sequence is represented by an embedding of length 1024. For a detailed explanation on how to retrieve the SeqVec embedding, we refer the reader to the specific GitHub repository: https://github.com/mheinzinger/SeqVec (accessed on 6 June 2021).
4.5. Step Forward Feature Selection
4.6. Classification Algorithms
- Partial least squares discriminant analysis (PLS-DA) is a partial least squares regression [46,47] where the response vector Y contains dummy variables indicating class labels (0–1 in this case). Sample predicted with are classified as belonging to class 1 and to class 0 other wise. PLS finds combinations of the original variable maximizing the covariance between the predictor variable and response Y by projecting the data in a k-dimensional space with k possibly much smaller than the original number of variables.
- Logistic Regression (LR). We used a penalised implementation of multivariable logistic regression [48].
4.7. Model Calibration and Validation
4.8. Metrics for Model Classification Accuracy
4.9. Prediction of Trans-Membrane Proteins
4.10. Software
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hartmann, T.; Bergsdorf, C.; Sandbrink, R.; Tienari, P.J.; Multhaup, G.; Ida, N.; Bieger, S.; Dyrks, T.; Weidemann, A.; Masters, C.L.; et al. Alzheimer’s disease βA4 protein release and amyloid precursor protein sorting are regulated by alternative splicing. J. Biol. Chem. 1996, 271, 13208–13214. [Google Scholar] [CrossRef] [Green Version]
- Shurety, W.; Merino-Trigo, A.; Brown, D.; Hume, D.A.; Stow, J.L. Localization and post-Golgi trafficking of tumor necrosis factor-alpha in macrophages. J. Interferon Cytokine Res. 2000, 20, 427–438. [Google Scholar] [CrossRef]
- Bryant, D.M.; Stow, J.L. The ins and outs of E-cadherin trafficking. Trends Cell Biol. 2004, 14, 427–434. [Google Scholar] [CrossRef]
- Andrade, M.A.; O’Donoghue, S.I.; Rost, B. Adaptation of protein surfaces to subcellular location. J. Mol. Biol. 1998, 276, 517–525. [Google Scholar] [CrossRef]
- Nakashima, H.; Nishikawa, K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 1994, 238, 54–61. [Google Scholar] [CrossRef]
- Dönnes, P.; Höglund, A. Predicting protein subcellular localization: Past, present, and future. Genom. Proteom. Bioinform. 2004, 2, 209–215. [Google Scholar] [CrossRef] [Green Version]
- Pierleoni, A.; Martelli, P.L.; Fariselli, P. BaCelLo: A Balanced subCellular Localization predictor. Bioinformatics 2006, 22, e408–e416. [Google Scholar] [CrossRef] [Green Version]
- Käll, L.; Krogh, A.; Sonnhammer, E.L. A Combined Transmembrane Topology and Signal Peptide Prediction Method. J. Mol. Biol. 2004, 338, 1027–1036. [Google Scholar] [CrossRef] [PubMed]
- Horton, P.; Park, K.J.; Obayashi, T.; Fujita, N.; Harada, H.; Adams-Collier, C.; Nakai, K. WoLF PSORT: Protein localization predictor. Nucleic Acids Res. 2007, 35, W585–W587. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Savojardo, C.; Martelli, P.L.; Fariselli, P.; Casadio, R. TPpred3 detects and discriminates mitochondrial and chloroplastic targeting peptides in eukaryotic proteins. Bioinformatics 2015, 31, 3269–3275. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jiang, Y.; Wang, D.; Yao, Y.; Eubel, H.; Künzler, P.; Møller, I.; Xu, D. MULocDeep: A Deep-Learning Framework for Protein Subcellular and Suborganellar Localization Prediction with Residue-Level Interpretation. 2020. [Google Scholar]
- Alley, E.; Khimulya, G.; Biswas, S.; Alquraishi, M.; Church, G. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019, 16. [Google Scholar] [CrossRef]
- Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019, 20, 1–17. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. bioRxiv 2020. [Google Scholar]
- Savojardo, C.; Bruciaferri, N.; Tartari, G.; Martelli, P.L.; Casadio, R. DeepMito: Accurate prediction of protein sub-mitochondrial localization using convolutional neural networks. Bioinformatics 2019, 36, 56–64. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Almagro Armenteros, J.J.; Sønderby, C.K.; Sønderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 2017, 33, 3387–3395. [Google Scholar] [CrossRef]
- Ho Thanh Lam, L.; Le, N.H.; Van Tuan, L.; Tran Ban, H.; Nguyen Khanh Hung, T.; Nguyen, N.T.K.; Huu Dang, L.; Le, N.Q.K. Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. Biology 2020, 9, 325. [Google Scholar] [CrossRef] [PubMed]
- Le, N.Q.K.; Huynh, T.T. Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation. Front. Physiol. 2019, 10, 1501. [Google Scholar] [CrossRef]
- Jing, X.; Dong, Q.; Hong, D.; Lu, R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 17, 1918–1931. [Google Scholar] [CrossRef]
- Kidera, A.; Konishi, Y.; Oka, M.; Ooi, T.; Scheraga, H. Statistical Analysis of the Physical Properties of the 20 Naturally Occurring Amino Acids. J. Protein Chem. 1985, 4, 23–55. [Google Scholar] [CrossRef]
- Attwood, T. Profile (Position-Specific Scoring Matrix, Position Weight Matrix, PSSM, Weight Matrix). In Dictionary of Bioinformatics and Computational Biology; American Cancer Society: Atlanta, GA, USA, 2004. [Google Scholar] [CrossRef]
- Stormo, G.D.; Schneider, T.D.; Gold, L.; Ehrenfeucht, A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982, 10, 2997–3011. [Google Scholar] [CrossRef] [Green Version]
- Wanders, R.J.A.; Waterham, H.R.; Ferdinandusse, S. Metabolic Interplay between Peroxisomes and Other Subcellular Organelles Including Mitochondria and the Endoplasmic Reticulum. Front. Cell Dev. Biol. 2016, 3, 83. [Google Scholar] [CrossRef] [Green Version]
- Islinger, M.; Voelkl, A.; Fahimi, H.; Schrader, M. The peroxisome: An update on mysteries 2.0. Histochem. Cell Biol. 2018, 150, 1–29. [Google Scholar] [CrossRef] [Green Version]
- Islinger, M.; Grille, S.; Fahimi, H.D.; Schrader, M. The peroxisome: An update on mysteries. Histochem. Cell Biol. 2012, 137, 547–574. [Google Scholar] [CrossRef] [Green Version]
- Farré, J.C.; Mahalingam, S.S.; Proietto, M.; Subramani, S. Peroxisome biogenesis, membrane contact sites, and quality control. Embo Rep. 2019, 20, e46864. [Google Scholar] [CrossRef] [PubMed]
- Baker, A.; Carrier, D.J.; Schaedler, T.; Waterham, H.; van Roermund, C.; Theodoulou, F. Peroxisomal ABC transporters: Functions and mechanism. Biochem. Soc. Trans. 2015, 43, 959–965. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schlüter, A.; Real-Chicharro, A.; Gabaldón, T.; Sánchez-Jiménez, F.; Pujol, A. PeroxisomeDB 2.0: An integrative view of the global peroxisomal metabolome. Nucleic Acids Res. 2009, 38, D800–D805. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lipka, V.; Dittgen, J.; Bednarek, P.; Bhat, R.; Wiermer, M.; Stein, M.; Landtag, J.; Brandt, W.; Rosahl, S.; Scheel, D.; et al. Pre- and Postinvasion Defenses Both Contribute to Nonhost Resistance in Arabidopsis. Science 2005, 310, 1180–1183. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Siddiqui, S.S.; Springer, S.A.; Verhagen, A.; Sundaramurthy, V.; Alisson-Silva, F.; Jiang, W.; Ghosh, P.; Varki, A. The Alzheimer’s disease–protective CD33 splice variant mediates adaptive loss of function via diversion to an intracellular pool. J. Biol. Chem. 2017, 292, 15312–15320. [Google Scholar] [CrossRef] [Green Version]
- Schapira, A.H. Mitochondrial disease. Lancet 2006, 368, 70–82. [Google Scholar] [CrossRef]
- Kumar, R.; Kumari, B.; Kumar, M. Proteome-wide prediction and annotation of mitochondrial and sub-mitochondrial proteins by incorporating domain information. Mitochondrion 2018, 42, 11–22. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Jin, Y.; Zhang, Q. DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment. Int. J. Mol. Sci. 2020, 21, 5710. [Google Scholar] [CrossRef]
- Savojardo, C.; Martelli, P.L.; Tartari, G.; Casadio, R. Large-scale prediction and analysis of protein sub-mitochondrial localization with DeepMito. BMC Bioinform. 2020, 21. [Google Scholar] [CrossRef]
- Morgat, A.; Lombardot, T.; Coudert, E.; Axelsen, K.; Neto, T.B.; Gehant, S.; Bansal, P.; Bolleman, J.; Gasteiger, E.; de Castro, E.; et al. Enzyme annotation in UniProtKB using Rhea. Bioinformatics 2019, 36, 1896–1901. [Google Scholar] [CrossRef]
- Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Altschul, S.; Madden, T.; Shaffer, A.; Zhang, J.; Zhang, Z. Gapped blast and psi-blast: A new generation of protein database search programs. Nucl. Acids. Res. 1996, 25, 3389–3402. [Google Scholar] [CrossRef] [Green Version]
- Suzek, B.E.; Huang, H.; McGarvey, P.; Mazumder, R.; Wu, C.H. UniRef: Comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007, 23, 1282–1288. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Meyer-Baese, A.; Schmid, V. Chapter 2-feature selection and extraction. In Pattern Recognition and Signal Analysis in Medical Imaging; Academic Press: San Diego, CA, USA, 2014; pp. 21–69. [Google Scholar]
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers; Association for Computing Machinery: New York, NY, USA, 1992; pp. 144–152. [Google Scholar] [CrossRef]
- Cristianini, N.; Ricci, E. Support Vector Machines. In Encyclopedia of Algorithms; Springer: Boston, MA, USA, 2008; pp. 928–932. [Google Scholar] [CrossRef]
- Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Wold, H. 11-Path Models with Latent Variables: The NIPALS Approach**NIPALS = Nonlinear Iterative PArtial Least Squares. In Quantitative Sociology; Blalock, H., Aganbegian, A., Borodkin, F., Boudon, R., Capecchi, V., Eds.; International Perspectives on Mathematical and Statistical Modeling; Academic Press: Cambridge, MA, USA, 1975; pp. 307–357. [Google Scholar] [CrossRef]
- Wold, S.; Ruhe, A.; Wold, H.; Dunn, W., III. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. Siam J. Sci. Stat. Comput. 1984, 5, 735–743. [Google Scholar] [CrossRef] [Green Version]
- Cramer, J. The Origins of Logistic Regression. Tinbergen Inst. Tinbergen Inst. Discuss. Pap. 2002. [Google Scholar] [CrossRef] [Green Version]
- Cawley, G.C.; Talbot, N.L.C. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
- Filzmoser, P.; Liebmann, B.; Varmuza, K. Repeated double cross validation. J. Chemom. J. Chemom. Soc. 2009, 23, 160–171. [Google Scholar] [CrossRef]
- Rijsbergen, C.J.V. Information Retrieval, 2nd ed.; Butterworth-Heinemann: Newton, MA, USA, 1979. [Google Scholar]
- Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The balanced accuracy and its posterior distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar]
- Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta Bba Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
- Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef] [PubMed]
- Sonnhammer, E.L.; Von Heijne, G.; Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. ISMB 1998, 6, 175–182. [Google Scholar] [PubMed]
- Krogh, A.; Larsson, B.; Von Heijne, G.; Sonnhammer, E.L. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 2001, 305, 567–580. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
(a) LR | |||||
(inner) | (outer) | BACC | MCC | ACC | |
1HOT | 0.577 | 0.623 0.071 | 0.618 0.075 | 0.269 0.143 | 0.809 0.036 |
PROP | 0.607 | 0.595 0.109 | 0.591 0.093 | 0.213 0.222 | 0.794 0.054 |
PSSM | 0.615 | 0.575 0.067 | 0.604 0.089 | 0.177 0.144 | 0.719 0.040 |
UniRep | 0.765 | 0.749 0.068 | 0.755 0.077 | 0.501 0.137 | 0.856 0.032 |
SeqVec | 0.792 | 0.712 0.068 | 0.726 0.079 | 0.427 0.140 | 0.825 0.042 |
UniRep + 1HOT | 0.636 | 0.648 0.103 | 0.65 0.111 | 0.312 0.204 | 0.806 0.061 |
UniRep + PROP | 0.614 | 0.595 0.104 | 0.589 0.093 | 0.234 0.217 | 0.812 0.040 |
UniRep + PSSM | 0.634 | 0.615 0.100 | 0.615 0.100 | 0.201 0.166 | 0.738 0.042 |
UniRep + SeqVec | 0.844 | 0.851 0.055 | 0.847 0.075 | 0.715 0.113 | 0.919 0.032 |
(b) SVM | |||||
(inner) | (outer) | BACC | MCC | ACC | |
1HOT | 0.624 | 0.693 0.130 | 0.713 0.139 | 0.396 0.261 | 0.819 0.070 |
PROP | 0.634 | 0.616 0.108 | 0.606 0.094 | 0.274 0.226 | 0.819 0.041 |
PSSM | 0.631 | 0.602 0.087 | 0.623 0.102 | 0.217 0.178 | 0.750 0.044 |
UniRep | 0.775 | 0.768 0.077 | 0.755 0.099 | 0.544 0.162 | 0.869 0.036 |
SeqVec | 0.778 | 0.777 0.046 | 0.813 0.052 | 0.567 0.090 | 0.856 0.038 |
SeqVec + 1HOT | 0.68 | 0.757 0.079 | 0.774 0.103 | 0.527 0.166 | 0.856 0.042 |
SeqVec + PROP | 0.648 | 0.597 0.114 | 0.589 0.099 | 0.218 0.229 | 0.812 0.044 |
SeqVec + PSSM | 0.634 | 0.614 0.091 | 0.639 0.110 | 0.244 0.188 | 0.756 0.041 |
SeqVec + UniRep | 0.825 | 0.859 0.031 | 0.863 0.042 | 0.721 0.060 | 0.919 0.015 |
(c) PLS-DA | |||||
(inner) | (outer) | BACC | MCC | ACC | |
1HOT | 0.452 | 0.452 0.005 | 0.500 0.001 | 0.001 0.001 | 0.825 0.015 |
PROP | 0.551 | 0.582 0.086 | 0.575 0.065 | 0.249 0.198 | 0.831 0.032 |
PSSM | 0.542 | 0.592 0.133 | 0.582 0.092 | 0.277 0.290 | 0.844 0.044 |
UniRep | 0.743 | 0.782 0.060 | 0.782 0.060 | 0.568 0.117 | 0.875 0.034 |
SeqVec | 0.759 | 0.707 0.081 | 0.695 0.080 | 0.419 0.160 | 0.844 0.044 |
UniRep + 1HOT | 0.478 | 0.471 0.051 | 0.502 0.034 | 0.002 0.112 | 0.806 0.023 |
UniRep + PROP | 0.478 | 0.471 0.051 | 0.502 0.034 | 0.267 0.128 | 0.825 0.032 |
UniRep + PSSM | 0.564 | 0.616 0.110 | 0.599 0.075 | 0.326 0.233 | 0.850 0.041 |
UniRep + SeqVec | 0.806 | 0.792 0.078 | 0.773 0.074 | 0.599 0.166 | 0.888 0.042 |
(d) RF | |||||
(inner) | (outer) | BACC | MCC | ACC | |
1HOT | 0.569 | 0.401 0.077 | 0.523 0.050 | 0.046 0.089 | 0.450 0.124 |
PROP | 0.631 | 0.572 0.016 | 0.564 0.012 | 0.203 0.090 | 0.812 0.020 |
PSSM | 0.618 | 0.585 0.110 | 0.567 0.088 | 0.261 0.261 | 0.819 0.064 |
UniRep | 0.732 | 0.741 0.051 | 0.779 0.079 | 0.503 0.104 | 0.838 0.023 |
SeqVec | 0.695 | 0.691 0.035 | 0.720 0.053 | 0.407 0.790 | 0.800 0.042 |
UniRep + 1HOT | 0.728 | 0.703 0.063 | 0.765 0.089 | 0.443 0.139 | 0.794 0.032 |
UniRep + PROP | 0.710 | 0.692 0.093 | 0.731 0.113 | 0.403 0.192 | 0.806 0.041 |
UniRep + PSSM | 0.699 | 0.743 0.100 | 0.776 0.128 | 0.501 0.209 | 0.844 0.052 |
UniRep + SeqVec | 0.778 | 0.764 0.135 | 0.790 0.141 | 0.540 0.267 | 0.850 0.087 |
UniRep + SeqVec + 1HOT | 0.774 | 0.721 0.108 | 0.738 0.121 | 0.456 0.214 | 0.844 0.044 |
UniRep + SeqVec + PROP | 0.720 | 0.787 0.134 | 0.793 0.144 | 0.581 0.261 | 0.888 0.061 |
UniRep + SeqVec + PSSM | 0.741 | 0.733 0.123 | 0.754 0.136 | 0.480 0.242 | 0.850 0.054 |
Membrane | Transmembrane | |
---|---|---|
O82399 | Q8H191 | O64883 |
P36168 | Q8K459 | P20138 |
P90551 | Q8VZF1 | Q84P23 |
Q12524 | Q9LYT1 | Q84P17 |
Q4WR83 | Q9NKW1 | Q84P21 |
Q75LJ4 | Q9S9W2 | Q9M0X9 |
Q9SKX5 | P08659 |
Data A | MCC(O) | MCC(I) | MCC(T) | MCC(M) |
---|---|---|---|---|
DeepMito | 0.460 | 0.470 | 0.530 | 0.650 |
DP-SM | 0.850 | 0.490 | 0.990 | 0.560 |
In-Mito (LR) | 0.680 | 0.730 | 0.690 | 0.820 |
In-Mito (SVM) | 0.640 | 0.690 | 0.620 | 0.800 |
Data B | ||||
SubMitoPred | 0.420 | 0.340 | 0.190 | 0.510 |
DeepMito | 0.450 | 0.680 | 0.540 | 0.790 |
DP-SM | 0.920 | 0.690 | 0.970 | 0.730 |
In-Mito (LR) | 0.690 | 0.750 | 0.620 | 0.850 |
In-Mito (SVM) | 0.650 | 0.760 | 0.540 | 0.840 |
Hyperparameters | |
---|---|
SVM |
|
RF |
|
PLS-DA |
|
LR |
|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Anteghini, M.; Martins dos Santos, V.; Saccenti, E. In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins. Int. J. Mol. Sci. 2021, 22, 6409. https://doi.org/10.3390/ijms22126409
Anteghini M, Martins dos Santos V, Saccenti E. In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins. International Journal of Molecular Sciences. 2021; 22(12):6409. https://doi.org/10.3390/ijms22126409
Chicago/Turabian StyleAnteghini, Marco, Vitor Martins dos Santos, and Edoardo Saccenti. 2021. "In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins" International Journal of Molecular Sciences 22, no. 12: 6409. https://doi.org/10.3390/ijms22126409
APA StyleAnteghini, M., Martins dos Santos, V., & Saccenti, E. (2021). In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins. International Journal of Molecular Sciences, 22(12), 6409. https://doi.org/10.3390/ijms22126409