News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning
Abstract
:1. Introduction
Managing urban areas has become one of the most important development challenges of the 21st century. Our success or failure in building sustainable cities will be a major factor in the success of the post-2015 UN development agenda [2].
- Improve operations that impact its quality of life (e.g., economic vitality, education, employment, environmental footprint, health care, power supply, safety, and transportation).
- Enable a shared understanding of what is happening in the ‘city’.
- Engage both citizens and (private and public) organizations.
- get a corpus of news published by the local newspaper through an RSS (Really Simple Syndication) feed,
- get a vector characterization based on the well-known Bag-of-Words (BoW) representation [6],
- select features through a mutual information-based method,
- train a supervised-learning model,
- classify online news reports in real-time; the interest is in the ‘traffic accident’ class,
- process the text of the RSS reports to retrieve the location where the accidents happened, and
- notify users about the events on a map.
2. Background
- Decision tree-based approaches, which select tags for input values devising decision rules; two of the most popular ones are
- ◦
- CART (Classification and Regression Tree), which partitions the continuous attribute value into a discrete set of intervals. CART, unlike other decision-tree classifiers, does not compute rule sets but supports numerical target variables (regression). CART constructs binary trees using the feature and threshold that yield the most significant information gain at each node [16]; and,
- ◦
- Random Forest, which improves the classification capacity and controls over-fitting by assembling several decision trees built on different sub-samples of the dataset [17].
- Bayesian classifiers, which estimate how each attribute value impacts on the likelihood of the class to be assigned. The Naïve Bayes method is the most simplified Bayesian classifier; it applies the Bayes theorem with the ‘naive’ assumption of conditional independence between every pair of features. It is well known that Naïve Bayes classifiers have worked markedly well in real-world situations, even when the underlying independence assumption is violated [18].
- Proximity-based classifiers, which use distance measures to classify under the premise that texts belonging to the same category are ‘closer’ than those in other classes. Perhaps, the kNN (k-nearest neighbors) method is the archetype of this kind of lazy learners. kNN predicts the label of a pattern by searching the k training samples that are the closest to the new entry.
- Linear classifiers, which classify based on the value of the linear combinations of the document features, trying to find ‘good’ linear separators among classes. Support Vector Machine (SVM) falls into this category. For a given training dataset, this eager classifier searches for an optimum hyperplane that classifies new samples [14]. In the scientific literature, they are popular text classifiers (e.g., [19,20,21,22,23,24]), because they are less sensitive (than other classifiers) to the presence of some potential issues of text mining, namely high-dimensional feature space, sparse vectors, and irrelevant features (cf. [19,25]).
- (a)
- high dimensionality of the vector representation,
- (b)
- loss of correlation with adjacent words, and
- (c)
- loss of semantic relationships among terms.
- (a)
- those observations usually overlap with the majority region,
- (b)
- data analysis techniques may confuse minority examples with noise or outlier data (and vice versa),
- (c)
- good coverage of the majority examples distorts the minority examples, and
- (d)
- a small sample with a lack of density but with high feature dimensionality makes it difficult to identify a pattern.
- Oversampling approaches: They synthetically generate new minority class examples. Two popular methods are
- ◦
- Random oversampling (ROS). This approach generates random examples following the distributional properties of the minority class to make this space denser.
- ◦
- Synthetic minority over-sampling technique (SMOTE). This approach generates new synthetic examples along the line between the minority examples and their selected nearest neighbors [34].
- ◦
- Borderline SMOTE. This algorithm is a variant of the original SMOTE algorithm. Here, only the minority examples near the borderline are oversampled [35].
- ◦
- Adaptive synthetic oversampling (ADASYN): the idea behind ADASYN is to generate more synthetic entries for the minority class examples that are harder to learn. This algorithm considers a weighted distribution for different minority class examples by their level of difficulty in learning [36].
- Undersampling approaches: they discard the intrinsic samples in the majority class. The simplest yet most effective method is random undersampling (RUS), which involves the elimination of majority class examples at random.
- ROC-AUC. The ROC—Receiver Operating Characteristic—is the curve formed when the transversal axis represents the ‘false positive rate’ (1-specificity), and the longitudinal axis represents the ‘true positive rate’ (sensitivity) for different cut-off points. ROC is a probability distribution, and its area under the curve (AUC) represents the degree of separability between classes. A ROC-AUC value close to 1 indicates that the classifier has excellent performance when separating classes, and a value close to 0.5 indicates that the classifier cannot discriminate correctly.
- PR-AUC. Like the ROC curve, the PR (Precision–Recall) curve is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. PR curve is a useful diagnostic tool for imbalanced binary models because it emphasizes the performance of a classifier in terms of the minority class, this way, its area under the curve (PR-AUC) summarizes the distribution, where a value of 1.0 represents a model with perfect skill.
3. A Brief Review of the Related Literature
4. Our Proposal
4.1. Data Gathering
4.2. Knowledge Discovery
Algorithm 1 Pseudocode of the knowledge-discovery phase |
In: Data: Set of news reports in natural language |
|
4.3. Knowledge Application and Deployment
- (a)
- sentence segmentation,
- (b)
- tokenization,
- (c)
- part of speech tagging (POS),
- (d)
- named entity recognition (NER), and
- (e)
- relationship extraction.
5. Results
5.1. Performance of the Classifiers on the Imbalanced Corpus
- CART: criterion = ‘entropy’, and max_depth = None.
- Complement Naïve Bayes: fit_prior = True, class_prior = None, and norm = False.
- kNN: n_neighbors = 10, weights = ‘distance’, and p = 2 (euclidian distance).
- Random Forest: n_estimators = 100, criterion = ‘entropy’, max_depth = None.
- SVM: loss = ‘hinge’, penalty = ‘l2’, α = 0.001, max_iter = 5, learning_rate = ’optimal’, ε = 0.1, and tol = None.
- The results from SVM were quite encouraging because this classifier ranked best in terms of PR-AUC (namely the most appropriate measure under conditions of class imbalance), and the second-best in terms of precision, F measure, and ROC-AUC.
- Random Forest is also remarkable, because it was the best considering ROC-AUC and F1-score, and the second-best in sensitivity and PR-AUC.
- Although CART obtained the highest recall, it was simultaneously the worst in terms of precision, ROC-AUC, and PR-AUC.
- Complement Naïve Bayes was only notable in precision, but it was not skillful in recovering instances of the minority class.
- According to Table 2, kNN (with that setting) seems to be the less fit for classifying this dataset (with the worst values in F-measure and recall).
5.2. Impact of the Sampling Methods on SVM and Random Forest
- SMOTE: sampling_strategy = ‘minority’, k_neighbors = 5, and ratio = None.
- Borderline SMOTE: sampling_strategy = ‘minority’, k_neighbors = 5, m_neighbors = 10, and kind = ‘borderline-1’.
- ADASYN: sampling_strategy = ‘minority’, n_neighbors = 5, and ratio = None.
- ROS: sampling_strategy = ‘minority’, and ratio = None.
- RUS: sampling_strategy = ‘majority’, replacement = False, and ratio = None.
- RUS was the best sampling approach in the scenario where the presence of false negatives is a critical concern (recall).
- We do not recommend resampling if false positives entail dire consequences (precision).
- In terms of F1-score, SMOTE and Borderline SMOTE mainly performed well.
- ROS seems to be the leading choice when considering ROC-AUC alone.
- Finally, an apparent inconsistency in the impact of sampling methods occurs when PR-AUC is considered decisive. On the one hand, SVM achieved the best results when ROS was applied (followed by RUS). On the other hand, contrary to expectations, no sampling algorithm improved the performance of Random Forest.
- (a)
- Recall: from 0.5658 to 0.8558.
- (b)
- F-measure: from 0.7069 to 0.7828.
- (c)
- ROC-AUC: from 0.9597 to 0.9693.
- (d)
- PR-AUC: from 0.8545 to 0.8561.
5.3. Performance of the Location Extraction Module
- The report mentioned multiple road mishaps and locations: the data extractor found several fragments of text matching the grammatical rules. However, these locations are not connected to—or close to—a common point when they are geolocated. They often are recapitulations of past news that included traffic accidents.
- The event occurred outside the city: although the extractor found a fragment of text matching the grammatical patterns, the point is not in the town when it is geolocated.
- There is not a specified location: the data extractor did not find any match of the grammatical rules.
- (a)
- Exact match. There is only a matching fragment of text in which the location was detected, as described in the news exactly.
- (b)
- Partial matches. The location was detected by several matching fragments of texts.
6. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A
- (a)
- A prefix to indicate the kind of road (either in singular or plural) followed by an optional article and preposition (e.g., ‘carretera de’, ‘avenida de la’, ‘Bulevar’, ‘periférico’).
- (b)
- The name of the road with an elective number followed by as far as four words in a little case format with an optional grammatical article and preposition between each one of them (e.g., ‘calle Norte’, ‘periférico de la Juventud’, ‘bulevar Juan Pablo II’, ‘calle 20 de noviembre’).
Appendix B
References
- United Nations. World Urbanization Prospects 2018. Available online: https://population.un.org/wup/ (accessed on 1 September 2020).
- United Nations. World’s Population Increasingly Urban with More than Half Living in Urban Areas. Available online: http://un.org/en/development/desa/news/population/world-urbanization-prospects-2014.html (accessed on 1 September 2020).
- Ochoa Ortiz-Zezzatti, A.; Rivera, G.; Gómez-Santillán, C.; Sánchez-Lara, B. Handbook of Research on Metaheuristics for Order Picking Optimization in Warehouses to Smart Cities; IGI Global: Hershey, PA, USA, 2019. [Google Scholar] [CrossRef]
- Smart Cities Council. Smart Cities A to Z. Glossary, letter “S”. Available online: http://rg.smartcitiescouncil.com/master-glossary/S (accessed on 1 September 2020).
- Williams, P. What, Exactly, is a Smart City? Available online: http://meetingoftheminds.org/exactly-smart-city-16098 (accessed on 1 September 2020).
- Harris, Z.S. Distributional structure. Word 1954, 10, 146–162. [Google Scholar] [CrossRef]
- Kaur, H.; Pannu, H.S.; Malhi, A.K. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. CSUR 2019, 52, 1–36. [Google Scholar] [CrossRef] [Green Version]
- Zhang, C.; Bi, J.; Xu, S.; Ramentol, E.; Fan, G.; Qiao, B.; Fujita, H. Multi-imbalance: An open-source software for multi-class imbalance learning. Knowl. Based Syst. 2019, 174, 137–143. [Google Scholar] [CrossRef]
- Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
- Fernández, A.; García, S.; Herrera, F. Addressing the classification with imbalanced data: Open problems and new challenges on class distribution. In International Conference on Hybrid Artificial Intelligence Systems; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–10. [Google Scholar]
- Lane, J. The 10 Most Spoken Languages in The World. Available online: http://babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world (accessed on 1 September 2020).
- Internet World Stats. Internet World Users by Language: Top 10 Languages. Usage and Population Statistics. Available online: https://www.internetworldstats.com/stats7.htm (accessed on 1 September 2020).
- Aliwy, A.H.; Ameer, E.A. Comparative study of five text classification algorithms with their improvements. Int. J. Appl. Eng. Res. 2017, 12, 4309–4319. [Google Scholar]
- Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv, 2017; arXiv:1707.02919. [Google Scholar]
- Thangaraj, M.; Sivakami, M. Text Classification Techniques: A Literature Review. Interdiscip. J. Inf. Knowl. Manag. 2018, 13, 117–135. [Google Scholar]
- Steinberg, D.; Colla, P. CART: Classification and Regression Trees. Top Ten Algorithms Data Min. 2009, 9, 179–201. [Google Scholar]
- Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning; Springer: Boston, MA, USA, 2012; pp. 157–175. [Google Scholar]
- Berrar, D. Bayes’ theorem and naïve Bayes classifier. In Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier Science Publisher: Amsterdam, The Netherlands, 2018; pp. 403–412. [Google Scholar]
- Catal, C.; Nangit, M. A sentiment classification model based on multiple classifiers. Appl. Soft Comput. 2017, 50, 135–141. [Google Scholar] [CrossRef]
- Ghaddar, B.; Naoum-Sawaya, J. High dimensional data classification and feature selection using support vector machines. Eur. J. Oper. Res. 2018, 265, 993–1004. [Google Scholar] [CrossRef]
- Goudjil, M.; Koudil, M.; Bedda, M.; Ghoggali, N. A novel active learning method using SVM for text classification. Int. J. Autom. Comput. 2018, 15, 290–298. [Google Scholar] [CrossRef]
- Hu, R.; Namee, B.M.; Delany, S.J. Active learning for text classification with reusability. Expert Syst. Appl. 2016, 45, 438–449. [Google Scholar] [CrossRef]
- Lilleberg, J.; Zhu, Y.; Zhang, Y. Support Vector Machines and word2vec for Text Classification with Semantic Features. In Proceedings of the 14th IEEE International Conference on Cognitive Informatics and Cognitive Computing, Beijing, China, 6–8 July 2015; pp. 136–140. [Google Scholar]
- Onan, A.; Korukoğlu, S.; Bulut, H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 2016, 57, 232–247. [Google Scholar] [CrossRef]
- Xia, H.; Yang, Y.; Pan, X.; Zhang, Z.; An, W. Sentiment analysis for online reviews using conditional random fields and support vector machines. Electron. Commer. Res. 2020, 20, 343–360. [Google Scholar] [CrossRef]
- El-Din, D.M. Enhancement bag-of-words model for solving the challenges of sentiment analysis. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar] [CrossRef] [Green Version]
- Fu, Y.; Feng, Y.; Cunningham, J.P. Paraphrase Generation with Latent Bag of Words. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2019; pp. 13623–13634. [Google Scholar]
- Kim, H.K.K.H.; Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017, 266, 336–352. [Google Scholar] [CrossRef] [Green Version]
- Zhao, R.; Mao, K. Fuzzy bag-of-words model for document representation. IEEE Trans. Fuzzy Syst. 2017, 26, 794–804. [Google Scholar] [CrossRef]
- Aggarwal, C.C.; Zhai, C. A Survey of Text Classification Algorithms. In Mining Text Data; Springer: Berlin/Heidelberg, Germany, 2012; pp. 163–222. [Google Scholar]
- Vergara, J.R.; Estévez, P.A. A review feature selection methods based on mutual information. Neural. Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
- Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
- García, V.; Sánchez, J.S.; Marqués, A.I.; Florencia, R.; Rivera, G. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 2019, 113026. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–377. [Google Scholar]
- He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Branco, P.; Torgo, L.; Ribeiro, R.P. A survey of predictive modeling on imbalanced domains. CSUR 2016, 49, 1–150. [Google Scholar] [CrossRef]
- Luhn, H.P. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1957, 1, 309–317. [Google Scholar] [CrossRef]
- Maron, M.E.; Kuhns, J.L. On relevance, probabilistic indexing and information retrieval. JACM 1960, 7, 216–244. [Google Scholar] [CrossRef]
- Arthur Frederick Parker-Rhodes. Contributions to the Theory of Clumps I; Cambridge Language Research Unit: Cambridge, UK, 1961; pp. 1–27. [Google Scholar]
- Sebastiani, F. Machine learning in automated text categorization. CSUR 2002, 34, 1–47. [Google Scholar] [CrossRef]
- Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European Conference Machine Learning; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. [Google Scholar]
- Zhuang, D.; Zhang, B.; Yang, Q.; Yan, J.; Chen, Z.; Chen, Y. Efficient text classification by weighted proximal SVM. In Proceedings of the Fifth IEEE International Conference on Data Mining, Houston, TX, USA, 27–30 November 2005; p. 8. [Google Scholar]
- Liu, Z.; Lv, X.; Liu, K.; Shi, S. Study on SVM compared with the other classification methods. In Proceedings of the 2010 Second International Workshop Education Technology and Computer Science, Wuhan, China, 6–7 March 2010; IEEE: Piscataway, NJ, USA, 2010; Volume 1, pp. 219–222. [Google Scholar]
- Kumar, M.A.; Gopal, M. An Investigation on Linear SVM and its Variants on Text Categorization. In Proceedings of the 2010 Second International Conference Machine Learning and Computing, Bangalore, India, 12–13 February 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 27–31. [Google Scholar]
- Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef] [Green Version]
- Boyle, J.A.; Greig, W.R.; Franklin, D.A.; Harden, R.M.; Buchanan, W.W.; McGirr, E.M. Construction of a model for computer assisted diagnosis: Application of the problem of non-toxic goitre. QJM 1966, 35, 565–588. [Google Scholar]
- Penny, W.D.; Frost, D.P. Neural network modeling of the level of observation decision in an acute psychiatric ward. Comput. Biomed. Res. 1997, 30, 1–17. [Google Scholar] [CrossRef]
- Xu, S. Naïve Bayes classifiers to text classification. J. Inf. Sci. 2018, 44, 48–59. [Google Scholar] [CrossRef]
- Friedman, J.H. On bias, variance, 0/1–loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1997, 1, 55–77. [Google Scholar] [CrossRef]
- McCallum, A.; Nigam, K. A comparison of event models for naïve Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization; AAAI Press: Madison, WI, USA, 27 July 1998; Volume 752, pp. 41–48. [Google Scholar]
- Xu, B.; Guo, X.; Ye, Y.; Cheng, J. An Improved Random Forest Classifier for Text Categorization. JCP 2012, 7, 2913–2920. [Google Scholar] [CrossRef] [Green Version]
- Tan, S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 2005, 28, 667–671. [Google Scholar] [CrossRef] [Green Version]
- Yong, Z.; Youwen, L.; Shixiong, X. An improved KNN text classification algorithm based on clustering. J. Comput. 2009, 4, 230–237. [Google Scholar]
- Sriram, B.; Fuhry, D.; Demir, E.; Ferhatosmanoglu, H.; Demirbas, M. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd ACM SIGIR International Conference of Research and Development on Information Retrieval, Geneva, Switzerland, 19–23 July 2010; pp. 841–842. [Google Scholar]
- Burnap, P.; Williams, M.L. Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making. Policy Internet. 2015, 7, 223–242. [Google Scholar] [CrossRef] [Green Version]
- Dilrukshi, I.; de Zoysa, K.; Caldera, A. Twitter news classification using SVM. In Proceedings of the 8th International Conference on Computer Science & Education, Colombo, Sri Lanka, 26–28 April 2013; pp. 287–291. [Google Scholar]
- Song, G.; Ye, Y.; Du, X.; Huang, X.; Bie, S. Short text classification: A survey. J. Multimed. 2014, 9, 635. [Google Scholar] [CrossRef]
- Hofmann, T. Probabilistic Latent Semantic Analysis. arXiv, 1999; arXiv:1301.6705. [Google Scholar]
- L’Huillier, G.; Hevia, A.; Weber, R.; Rios, S. Latent semantic análisis and keyword extraction for phishing classification. In Proceedings of the 2010 IEEE International Conference on Intelligence and Security Informatics, Vancouver, BC, Canada, 23–26 May 2010; pp. 129–131. [Google Scholar]
- Zeng, Z.; Zhang, S.; Liang, H.L.W.; Zheng, H. A novel approach to musical genre classification using probabilistic latent semantic analysis model. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, Cancun, Mexico, 28 June–3 July 2009; pp. 486–489. [Google Scholar]
- Bosch, A.; Zisserman, A.; Muñoz, X. Scene classification via pLSA. In European Conference Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 517–530. [Google Scholar]
- Díaz, G.; Romero, E. Histopathological Image Classification Using Stain Component Features on a pLSA Model. In Iberoamerican Congress Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2010; pp. 55–62. [Google Scholar]
- Haloi, M. A novel pLSA based Trafic Signs Classification System. arXiv, 2015; arXiv:abs/1503.06643. [Google Scholar]
- Kroha, P.; Baeza-Yates, R. A Case Study: News Classification Based on Term Frequency. In Proceedings of the 16th International Workshop on Database and Expert Systems Applications, Copenhagen, Denmark, 22–26 August 2005; pp. 428–432. [Google Scholar]
- Mouriño-García, M.A.; Pérez-Rodríguez, R.; Anido-Rifón, L.; Vilares-Ferro, M. Wikipedia-based hybrid document representation for textual news classification. Soft Comput. 2018, 22, 6047–6065. [Google Scholar] [CrossRef]
- Sankaranarayanan, J.; Samet, H.; Teitler, B.E.; Lieberman, M.D.; Sperling, J. Twitterstand: News in Tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 4–6 November 2009; pp. 42–51. [Google Scholar]
- Li, C.; Zhan, G.; Li, Z. News text classification based on improved Bi-LSTM-CNN. In Proceedings of the IEEE 9th International Conference on Information Technology in Medicine and Education, Hangzhou, China, 19–21 October 2018; pp. 890–893. [Google Scholar]
- Dadgar, S.M.H.; Araghi, M.S.; Farahani, M.M. A novel text mining approach based on TF-IDF and Support Vector Machine for news classification. In Proceedings of the 2016 IEEE Internatoonal Conference Engineering and Technology, Coimbatore, India, 17–18 March 2016; pp. 112–116. [Google Scholar]
- Bondielli, A.; Marcelloni, F. A survey on fake news and rumour detection techniques. Inf. Sci. 2019, 38–55. [Google Scholar] [CrossRef]
- Kusumaningrum, R.; Wiedjayanto, M.I.A.; Adhy, S. Classification of Indonesian news articles based on Latent Dirichlet Allocation. In Proceedings of the 2016 International Conference Data and Software Engineering, Denpasar, Indonesia, 26–27 October 2016; pp. 1–5. [Google Scholar]
- Shehab, M.A.; Badarneh, O.; Al-Ayyoub, M.; Jararweh, Y. A supervised approach for multi-label classification of Arabic news articles. In Proceedings of the 2016 7th International Conference Computer Science and Information Technology, Amman, Jordan, 13–14 July 2016; pp. 1–6. [Google Scholar]
- Van, T.P.; Thanh, T.M. Vietnamese news classification based on BoW with keywords extraction and neural network. In Proceedings of the 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems, Hanoi, Vietnam, 15–17 November 2017; pp. 43–48. [Google Scholar]
- Wang, M.; Cai, Q.; Wang, L.; Li, J.; Wang, X. Chinese news text classification based on attention-based CNN-BiLSTM. In Proceedings of the MIPPR 2019: Pattern Recognition and Computer Vision, Wuhan, China, 2–3 November 2019. [Google Scholar]
- Pazos-Rangel, R.A.; Florencia-Juarez, R.; Paredes-Valverde, M.A.; Rivera, G. Handbook of Research on Natural Language Processing and Smart Service Systems; IGI Global: Hershey, PA, USA, 2017. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Müller, A.C.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. Presented at the European Conference Machine Learning and Principles and Practices of Knowledge Discovery in Databases. arXiv, 2013; arXiv:1309.0238. [Google Scholar]
- SpaCy. Industrial-Strength Natural Language Processing IN PYTHON. Available online: https://spacy.io (accessed on 1 September 2020).
Metric | Formula * | Description |
---|---|---|
Sensitivity or recall | It is the fraction of positive patterns that are correctly classified. Under the presence of imbalanced classes, recall typically measures the coverage of the minority class. | |
Precision | It evaluates the proportion of correctly classified instances among the ones classified as positive. For imbalanced classification, it often calculates the accuracy of the minority class. | |
Specificity | It is used to measure the fraction of negative patterns that are correctly classified. It is only appropriate when false negatives are highly costly. | |
F-measure or F1-score | It is the harmonic mean of precision and recall. The F1-measure is the most often used threshold metric for learning from imbalanced data. |
Folds | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier | Metric | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
CART | Recall | 0.640 | 0.772 | 0.678 | 0.722 | 0.615 | 0.723 | 0.707 | 0.611 | 0.668 | 0.652 | 0.679 |
Precision | 0.676 | 0.768 | 0.671 | 0.712 | 0.657 | 0.665 | 0.692 | 0.707 | 0.658 | 0.663 | 0.687 | |
F measure | 0.657 | 0.770 | 0.675 | 0.717 | 0.635 | 0.693 | 0.699 | 0.655 | 0.663 | 0.657 | 0.682 | |
ROC-AUC | 0.803 | 0.873 | 0.818 | 0.842 | 0.789 | 0.841 | 0.836 | 0.791 | 0.815 | 0.808 | 0.822 | |
PR-AUC | 0.676 | 0.781 | 0.693 | 0.733 | 0.656 | 0.708 | 0.713 | 0.679 | 0.680 | 0.674 | 0.699 | |
Compl. Naïve Bayes | Recall | 0.231 | 0.272 | 0.229 | 0.231 | 0.230 | 0.277 | 0.287 | 0.200 | 0.251 | 0.242 | 0.245 |
Precision | 0.977 | 0.980 | 1.000 | 0.980 | 0.956 | 0.927 | 1.000 | 1.000 | 0.979 | 0.956 | 0.976 | |
F measure | 0.374 | 0.426 | 0.373 | 0.374 | 0.371 | 0.427 | 0.446 | 0.333 | 0.400 | 0.386 | 0.391 | |
ROC-AUC | 0.929 | 0.961 | 0.941 | 0.942 | 0.932 | 0.925 | 0.941 | 0.930 | 0.946 | 0.936 | 0.938 | |
PR-AUC | 0.759 | 0.831 | 0.799 | 0.813 | 0.748 | 0.731 | 0.812 | 0.774 | 0.802 | 0.751 | 0.782 | |
kNN | Recall | 0.091 | 0.130 | 0.137 | 0.061 | 0.091 | 0.038 | 0.083 | 0.221 | 0.064 | 0.101 | 0.102 |
Precision | 1.000 | 0.889 | 0.966 | 1.000 | 0.895 | 0.700 | 0.882 | 0.955 | 0.923 | 0.947 | 0.916 | |
F measure | 0.167 | 0.227 | 0.239 | 0.116 | 0.165 | 0.072 | 0.152 | 0.359 | 0.120 | 0.183 | 0.180 | |
ROC-AUC | 0.927 | 0.952 | 0.915 | 0.950 | 0.931 | 0.905 | 0.928 | 0.929 | 0.919 | 0.928 | 0.928 | |
PR-AUC | 0.797 | 0.849 | 0.813 | 0.873 | 0.769 | 0.719 | 0.786 | 0.826 | 0.776 | 0.804 | 0.801 | |
Random Forest | Recall | 0.559 | 0.663 | 0.556 | 0.580 | 0.572 | 0.582 | 0.569 | 0.532 | 0.572 | 0.579 | 0.576 |
Precision | 0.920 | 0.946 | 0.919 | 0.976 | 0.892 | 0.892 | 0.963 | 0.935 | 0.964 | 0.936 | 0.934 | |
F measure | 0.696 | 0.780 | 0.693 | 0.728 | 0.697 | 0.704 | 0.715 | 0.678 | 0.718 | 0.715 | 0.712 | |
ROC-AUC | 0.967 | 0.979 | 0.961 | 0.971 | 0.950 | 0.951 | 0.968 | 0.963 | 0.964 | 0.963 | 0.964 | |
PR-AUC | 0.849 | 0.896 | 0.855 | 0.885 | 0.813 | 0.814 | 0.878 | 0.832 | 0.870 | 0.848 | 0.854 | |
SVM | Recall | 0.575 | 0.668 | 0.522 | 0.547 | 0.561 | 0.560 | 0.547 | 0.511 | 0.604 | 0.562 | 0.566 |
Precision | 0.955 | 0.939 | 0.973 | 0.943 | 0.938 | 0.904 | 0.952 | 0.960 | 0.950 | 0.943 | 0.946 | |
F measure | 0.718 | 0.781 | 0.679 | 0.693 | 0.702 | 0.691 | 0.695 | 0.667 | 0.739 | 0.704 | 0.707 | |
ROC-AUC | 0.963 | 0.979 | 0.949 | 0.976 | 0.947 | 0.947 | 0.967 | 0.954 | 0.955 | 0.962 | 0.960 | |
PR-AUC | 0.831 | 0.903 | 0.867 | 0.890 | 0.827 | 0.785 | 0.872 | 0.842 | 0.865 | 0.863 | 0.855 |
Classifier | Sampling Algorithm | Threshold Metrics | Ranking Metrics | |||
---|---|---|---|---|---|---|
Recall | Precision | F-Measure | ROC-AUC | PR-AUC | ||
SVM | None (Base version) | 0.5658 | 0.9456 *,§ | 0.7069 | 0.9597 | 0.8545 |
SMOTE | 0.8278 | 0.7511 | 0.7873 † | 0.9681 | 0.8485 | |
Borderline SMOTE | 0.8195 | 0.7671 | 0.7921 *,§ | 0.9661 | 0.8515 | |
ADASYN | 0.8332 | 0.7376 | 0.7821 | 0.9687 † | 0.8490 | |
ROS | 0.8558 | 0.7217 | 0.7828 | 0.9693 *,§ | 0.8561 *,§ | |
RUS | 0.8596 †,§ | 0.7074 | 0.7752 | 0.9693 *,§ | 0.8550 | |
Random Forest | None (Base version) | 0.5765 | 0.9355 †,§ | 0.7129 | 0.9627 | 0.8552 †,§ |
SMOTE | 0.7114 | 0.8820 | 0.7872 § | 0.9639 | 0.8481 | |
Borderline SMOTE | 0.7041 | 0.8917 | 0.7865 | 0.9624 | 0.8503 | |
ADASYN | 0.7151 | 0.8749 | 0.7866 | 0.9633 | 0.8495 | |
ROS | 0.6759 | 0.9050 | 0.7730 | 0.9647 § | 0.8429 | |
RUS | 0.8641 *,§ | 0.5523 | 0.6736 | 0.9621 | 0.8428 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rivera, G.; Florencia, R.; García, V.; Ruiz, A.; Sánchez-Solís, J.P. News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning. Appl. Sci. 2020, 10, 6253. https://doi.org/10.3390/app10186253
Rivera G, Florencia R, García V, Ruiz A, Sánchez-Solís JP. News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning. Applied Sciences. 2020; 10(18):6253. https://doi.org/10.3390/app10186253
Chicago/Turabian StyleRivera, Gilberto, Rogelio Florencia, Vicente García, Alejandro Ruiz, and J. Patricia Sánchez-Solís. 2020. "News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning" Applied Sciences 10, no. 18: 6253. https://doi.org/10.3390/app10186253
APA StyleRivera, G., Florencia, R., García, V., Ruiz, A., & Sánchez-Solís, J. P. (2020). News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning. Applied Sciences, 10(18), 6253. https://doi.org/10.3390/app10186253