A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification
Abstract
:1. Introduction
2. Theoretical Background
2.1. Ensemble Classifiers
2.2. Multi-View Ensemble Learning
3. Related Works
4. Material and Methods
- (1)
- Figure 1a: The text pre-processing techniques that are applied in order to create each section view. Different pre-processing methods may be applied to different sections.
- (2)
- Figure 1b: The classifying algorithms that are used as base learners. Different classifiers may be used for different sections.
- (3)
- Figure 1c:The classifying algorithm used as a meta-classifier.
5. Experiments
5.1. Dataset Construction
5.2. View Generation Phase: Text Pre-Processing
- Named Entity Recognition (NER): Named Entity Recognition (NER) is the task of identifying terms that mention a known entity in the text. Entities typically fall into a pre-defined set of categories such as person, location, organization, etc. For the purpose of our work, we are interested in identifying entities from the Life Sciences such as proteins, genes, etc. For this reason, we used the Biomedical Named Entity Recognition tool called ABNER [19];
- Special characters removal: punctuation, digits and some special characters (such as “;”; “:”; “!”; “?”; “0”; “[” or “]”) are removed;
- Tokenization: splits the document sections into tokens, e.g., terms or attributes;
- Stopwords removal: It removes words that are meaningless such as articles, conjunctions and prepositions (e.g., “a”, “the”, “at”). We used a list of 659 stopwords to be identified and removed from the documents;
- Dictionary Validation: A term is considered valid if it appears in a dictionary. We gathered several dictionaries for common English terms, such as ispell https://www.cs.hmc.edu/~geoff/ispell-dictionaries.html (accessed on 29 March 2022) and WordNet http://wordnet.princeton.edu/ (accessed on 29 March 2022) [20]. For biological and medical terms, we used BioLexicon [21], the Hosford Medical Terms Dictionary and Gene Ontology (GO) http://www.geneontology.org/ (accessed on 29 March 2022);
- Synonyms handling: using the WordNet (an English lexical database) for regular English (“non technical” words) and Gene Ontology for technical terms;
- Stemming: It is the process of removing inflectional affixes of words, thus reducing the words to their root. We used the Porter Stemmer algorithm [22] to normalize several terms variants into the same form and to reduce the number of terms;
- Bag of Words (BoW): It is the traditional representation of a document corpus. A document–term matrix is used, where each row represents a document from the corpus and each column represents a word of the vocabulary of the corpus. The weight calculation uses the normalized frequencies of the words that is given by the Term Frequency-Inverse Document Frequency (TF-IDF) [23].
5.3. Ensemble Training Phase: Base Learners and Meta-Learner
5.4. Kappa Statistics and Statistical Significance
6. Results and Discussion
6.1. Comparing Title–Abstract Classification vs. Full Text Classification
6.2. Comparing Full Text Classification with Stacking vs. Single Classifier
7. Discussion
8. Conclusions and Future Work
- We propose a novel, efficient multi-view ensemble classification scheme based on stacking. Our experimental comparison with the traditional classification indicates that the proposed scheme is better when used for text classification;
- The work contributes by providing significant benefits for the biomedical full text document mining research. To the best of our knowledge, our study is the first to use a multi-view ensemble learning schema for full text scientific document classification;
- Although the proposed classification scheme was developed based on an empirical analysis of biomedical documents, it can be applied to several other structured text corpora, including web pages, blogs, tweets, scientific text repositories or full text databases.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
- Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
- Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; Chapman and Hall/CRC: Boca Raton, FL, USA, 2019. [Google Scholar]
- Pfahringer, B.; Bensusan, H.; Giraud-Carrier, C.G. Meta-Learning by Landmarking Various Learning Algorithms. In Proceedings of the ICML, Stanford, CA, USA, 29 June–2 July 2000; pp. 743–750. [Google Scholar]
- Gaye, B.; Zhang, D.; Wulamu, A. A Tweet Sentiment Classification Approach Using a Hybrid Stacked Ensemble Technique. Information 2021, 12, 374. [Google Scholar] [CrossRef]
- Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Xu, C.; Tao, D.; Xu, C. A Survey on Multi-view Learning. arXiv 2013, arXiv:1304.5634. [Google Scholar]
- Bickel, S.; Scheffer, T. Multi-View Clustering. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM ’04), Brighton, UK, 1–4 November 2004; pp. 19–26. [Google Scholar]
- Kumar, V.; Minz, S. Multi-view ensemble learning: An optimal feature set partitioning for high-dimensional data classification. Knowl. Inf. Syst. 2016, 49, 1–59. [Google Scholar] [CrossRef]
- Bai, J.; Wang, J. Improving malware detection using multi-view ensemble learning. Secur. Commun. Netw. 2016, 9, 4227–4241. [Google Scholar] [CrossRef]
- Cuzzocrea, A.; Folino, F.; Guarascio, M.; Pontieri, L. A multi-view multi-dimensional ensemble learning approach to mining business process deviances. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 3809–3816. [Google Scholar]
- Liu, Y.; Jiang, C.; Zhao, H. Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums. Decis. Support Syst. 2018, 105, 1–12. [Google Scholar] [CrossRef]
- Fraj, M.; Hajkacem, M.A.B.; Essoussi, N. On the use of ensemble method for multi view textual data. J. Inf. Telecommun. 2020, 4, 461–481. [Google Scholar] [CrossRef]
- Ye, X.; Dai, H.; Dong, L.; Wang, X. Multi-view ensemble learning method for microblog sentiment classification. Expert Syst. Appl. 2021, 166, 113987. [Google Scholar] [CrossRef]
- Hersh, W.; Buckley, C.; Leone, T.J.; Hickam, D. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’94), Dublin, Ireland, 3–6 July 1994; Croft, B.W., van Rijsbergen, C.J., Eds.; Springer: London, UK, 1994; pp. 192–201. [Google Scholar]
- Gonçalves, C.; Iglesias, E.L.; Borrajo, L.; Camacho, R.; Vieira, A.S.; Gonçalves, C.T. Learnsec: A framework for full text analysis. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Oviedo, Spain, 20–22 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 502–513. [Google Scholar]
- Gonçalves, C.A.O.; Camacho, R.; Gonçalves, C.T.; Seara Vieira, A.; Borrajo Diz, L.; Lorenzo Iglesias, E. Classification of Full Text Biomedical Documents: Sections Importance Assessment. Appl. Sci. 2021, 11, 2674. [Google Scholar] [CrossRef]
- Gonçalves, C.A.; Gonçalves, C.T.; Camacho, R.; Oliveira, E.C. The Impact of Pre-processing on the Classification of MEDLINE Documents. In Pattern Recognition in Information Systems, Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems, Funchal, Portugal, 8–12 June 2010; SciTePress: Setúbal, Portugal, 2010; pp. 53–61. [Google Scholar]
- Settles, B. ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21, 3191–3192. [Google Scholar] [CrossRef] [PubMed]
- Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Rebholz-Schuhmann, D.; Pezik, P.; Lee, V.; Kim, J.J.; del Gratta, R.; Sasaki, Y.; McNaught, J.; Montemagni, S.; Monachini, M.; Calzolari, N.; et al. BioLexicon: Towards a Reference Terminological Resource in the Biomedical Domain. In Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB-2008), Toronto, ON, Canada, 19–23 July 2008. [Google Scholar]
- Porter, M.F. Readings in Information Retrieval; Chapter: An Algorithm for Suffix Stripping; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; pp. 313–316. [Google Scholar]
- Zhou, W.; Smalheiser, N.R.; Clement, Y. A Tutorial on Information Retrieval: Basic Terms and Concepts. J. Biomed. Discov. Collab. 2006, 1, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zeng, Z.Q.; Yu, H.B.; Xu, H.R.; Xie, Y.Q.; Gao, J. Fast training support vector machines using parallel sequential minimal optimization. In Proceedings of the IEEE 2008 3rd International Conference on Intelligent System and Knowledge Engineering, Xiamen, China, 17–19 November 2008; Volume 1, pp. 997–1001. [Google Scholar]
- Ženko, B.; Todorovski, L.; Džeroski, S. A Comparison of Stacking with Meta Decision Trees to Bagging, Boosting, and Stacking with Other Methods. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 669–670. [Google Scholar]
- Zian, S.; Kareem, S.A.; Varathan, K.D. An Empirical Evaluation of Stacked Ensembles With Different Meta-Learners in Imbalanced Classification. IEEE Access 2021, 9, 87434–87452. [Google Scholar] [CrossRef]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Carletta, J. Assessing agreement on classification tasks: The kappa statistic. arXiv 1996, arXiv:cmp-lg/9602004. [Google Scholar]
- Viera, A.J.; Garrett, J.M. Understanding interobserver agreement: The kappa statistic. Fam. Med. 2005, 37, 360–363. [Google Scholar] [PubMed]
- Nadeau, C.; Bengio, Y. Inference for the Generalization Error. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1999; Volume 12, Available online: https://proceedings.neurips.cc/paper/1999/hash/7d12b66d3df6af8d429c1a357d8b9e1a-Abstract.html (accessed on 1 April 2022).
Dataset | Description | Rel. | Non Rel. |
---|---|---|---|
C01 | Bacterial Infections and Mycoses | 417 | 13,625 |
C02 | Virus Diseases | 1178 | 13,080 |
C03 | Parasitic Diseases | 51 | 13,884 |
C04 | Neoplasms | 5537 | 8789 |
C05 | Musculoskeletal | 51 | 13,884 |
C06 | Digestive System | 1662 | 12,484 |
C07 | Stomatognathic | 145 | 13,372 |
C08 | Respiratory Tract | 857 | 13,184 |
C09 | Otorhinolaryngologic | 215 | 13,845 |
C10 | Nervous System | 2780 | 11,394 |
C11 | Eye Diseases | 392 | 13,699 |
C12 | Urologic and Male Genital Diseases | 1196 | 12,985 |
C13 | Female Genital Diseases and Pregnancy Complications | 1136 | 12,954 |
C14 | Cardiovascular Diseases | 2532 | 11,792 |
C15 | Hemic and Lymphatic | 450 | 13,756 |
C16 | Neonatal Diseases and Abnormalities | 469 | 13,753 |
C17 | Skin and Connective Tissue | 1227 | 13,072 |
C18 | Nutritional and Metabolic | 1043 | 13,267 |
C19 | Endocrine Diseases | 772 | 13,415 |
C20 | Immunologic Diseases | 1721 | 12,536 |
C22 | Animal Diseases | 76 | 13,964 |
C23 | Pathological Conditions, Signs and Symptoms | 7,191 | 7,136 |
C25 | Chemically-Induced Disorders | 174 | 13,995 |
C26 | Wounds and Injuries | 247 | 13,949 |
Terms | Title | Abstract | Introduction | Methods | Results | Conclusions |
---|---|---|---|---|---|---|
angiotonin | 0/31 | 1/104 | 0/506 | 2/231 | 20/972 | 0/122 |
collagen | 0/14 | 0/327 | 0/1878 | 2/2040 | 0/9248 | 0/594 |
diabet | 0/214 | 4/1493 | 5/5573 | 0/6299 | 42/22,692 | 0/2463 |
hypertens | 1/136 | 7/914 | 6/3319 | 6/3664 | 164/14,274 | 0/1203 |
insulin | 0/107 | 0/925 | 0/4511 | 4/3955 | 6/16,962 | 0/1001 |
kidnei | 1/96 | 2/669 | 4/2654 | 28/3152 | 58/11,594 | 0/636 |
methanol | 0/0 | 0/2 | 0/38 | 4/1696 | 0/310 | 0/132 |
pathogenesi | 0/28 | 0/653 | 0/2509 | 0/213 | 2/4402 | 0/377 |
Section | #Terms |
---|---|
Title | 4798 |
Abstract | 8822 |
Introduction | 14,130 |
Methods | 15,948 |
Results | 19,255 |
Conclusions | 11,257 |
Kappa Agreement | |
---|---|
<0 | Less than chance |
0.01–0.20 | Slight |
0.21–0.40 | Fair |
0.41–0.60 | Moderate |
0.61–0.80 | Substantial |
0.81–0.99 | Almost perfect |
Corpus | Multi-View Full Text | Single SVM Full Text |
---|---|---|
C01 | 0.30 | 0.43 |
C04 | 0.79 | 0.82 |
C06 | 0.38 | 0.54 |
C14 | 0.52 | 0.63 |
C20 | 0.51 | 0.62 |
Corpus | Title | Abstract | Introd. | Methods | Results | Conclusions |
---|---|---|---|---|---|---|
C01 | 0.13 | 0.15 | 0.17 | 0.30 | 0.38 | 0.05 |
C02 | 0.17 | 0.26 | 0.39 | 0.55 | 0.64 | 0.09 |
C03 | 0.26 | 0.32 | 0.22 | 0.33 | 0.34 | 0.02 |
C04 | 0.16 | 0.61 | 0.56 | 0.59 | 0.65 | 0.10 |
C05 | 0.06 | 0.07 | 0.11 | 0.11 | 0.18 | 0.03 |
C06 | 0.10 | 0.17 | 0.15 | 0.37 | 0.49 | 0.05 |
C07 | 0.16 | 0.16 | 0.13 | 0.19 | 0.23 | 0.05 |
C08 | 0.12 | 0.17 | 0.14 | 0.34 | 0.45 | 0.06 |
C09 | 0.20 | 0.18 | 0.25 | 0.30 | 0.34 | 0.15 |
C10 | 0.15 | 0.24 | 0.26 | 0.45 | 0.50 | 0.11 |
C11 | 0.21 | 0.37 | 0.23 | 0.52 | 0.55 | 0.08 |
C12 | 0.10 | 0.25 | 0.23 | 0.38 | 0.54 | 0.06 |
C13 | 0.12 | 0.21 | 0.22 | 0.34 | 0.44 | 0.08 |
C14 | 0.15 | 0.31 | 0.25 | 0.45 | 0.51 | 0.15 |
C15 | 0.17 | 0.11 | 0.13 | 0.12 | 0.24 | 0.04 |
C16 | 0.09 | 0.09 | 0.16 | 0.09 | 0.17 | 0.03 |
C17 | 0.10 | 0.22 | 0.24 | 0.39 | 0.52 | 0.04 |
C18 | 0.07 | 0.12 | 0.15 | 0.28 | 0.39 | 0.09 |
C19 | 0.09 | 0.15 | 0.16 | 0.28 | 0.43 | 0.03 |
C20 | 0.16 | 0.21 | 0.28 | 0.45 | 0.52 | 0.07 |
C22 | 0.10 | 0.13 | 0.10 | 0.11 | 0.20 | 0.00 |
C23 | 0.22 | 0.35 | 0.29 | 0.32 | 0.36 | 0.07 |
C25 | 0.13 | 0.06 | 0.12 | 0.21 | 0.32 | 0.03 |
C26 | 0.07 | 0.13 | 0.07 | 0.18 | 0.23 | 0.03 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gonçalves, C.A.; Vieira, A.S.; Gonçalves, C.T.; Camacho, R.; Iglesias, E.L.; Diz, L.B. A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification. Information 2022, 13, 283. https://doi.org/10.3390/info13060283
Gonçalves CA, Vieira AS, Gonçalves CT, Camacho R, Iglesias EL, Diz LB. A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification. Information. 2022; 13(6):283. https://doi.org/10.3390/info13060283
Chicago/Turabian StyleGonçalves, Carlos Adriano, Adrián Seara Vieira, Célia Talma Gonçalves, Rui Camacho, Eva Lorenzo Iglesias, and Lourdes Borrajo Diz. 2022. "A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification" Information 13, no. 6: 283. https://doi.org/10.3390/info13060283
APA StyleGonçalves, C. A., Vieira, A. S., Gonçalves, C. T., Camacho, R., Iglesias, E. L., & Diz, L. B. (2022). A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification. Information, 13(6), 283. https://doi.org/10.3390/info13060283