applsci-logo

Journal Browser

Journal Browser

Application of Machine Learning in Text Mining

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 December 2023) | Viewed by 54028

Special Issue Editors


E-Mail Website
Guest Editor
Computer Engineering, Korea Maritime and Ocean University, Busan 49112, Korea
Interests: natural language processing; sentiment analysis; information retrieval; information extraction; machine translation

E-Mail Website
Guest Editor
Industrial Engineering, Hanyang University, Seoul 04763, Republic of Korea
Interests: data mining; machine learning; time series analysis; anomaly detection; process mining
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

According to Wikipedia, “Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text.” The definition of text mining is an automatic process of extracting knowledge from unstructured text—especially Web text, books, emails, reviews or micro-text of SNS, clinical medical records, lyrics, etc.---using natural language techniques. The process has many subtasks, such as text classification, text clustering, text summarization, text visualization, information retrieval, information extraction, word and/or document embeddings, and so on. Text mining can be broadly applied in areas such as economics, education, academic research, government, marketing, and business. It has a wide range of applications in patent analysis, copyright analysis, internet security, text classification for news articles, bioinformatics, anti-spam filtering, lyric text mining, advertisement funnel, sentiment analysis (product reviews, customer surveys, movie reviews, polls, etc.), and more. The topics of interest for this Special Issue include but are not limited to the following:

  • Information retrieval;
  • Information extraction;
  • Relation extraction;
  • Named-entity recognition;
  • Sentiment analysis;
  • Text categorization;
  • Text clustering;
  • Text summarization;
  • Fake news detection;
  • Topic detection;
  • Trend detection;
  • Topic tracking;
  • Language detection;
  • Intent detection;
  • Keyword extraction.

We invite the submission of both original research and review articles. Additionally, invited papers based on excellent contributions to recent conferences in this field will be included in this Special Issue. We hope that this collection of high-quality works in text mining will serve as an inspiration for future research in the field.

Prof. Dr. Jae-Hoon Kim
Prof. Dr. Kichun Lee
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • information retrieval
  • information extraction
  • relation extraction
  • named-entity recognition
  • sentiment analysis
  • text categorization
  • text clustering
  • text summarization
  • fake news detection
  • topic detection
  • trend detection
  • topic tracking
  • language detection
  • intent detection
  • keyword extraction

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (21 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 2189 KiB  
Article
Recognizing Textual Inference in Mongolian Bar Exam Questions
by Garmaabazar Khaltarkhuu, Biligsaikhan Batjargal and Akira Maeda
Appl. Sci. 2024, 14(3), 1073; https://doi.org/10.3390/app14031073 - 26 Jan 2024
Viewed by 1072
Abstract
This paper examines how to apply deep learning techniques to Mongolian bar exam questions. Several approaches that utilize eight different fine-tuned transformer models were demonstrated for recognizing textual inference in Mongolian bar exam questions. Among eight different models, the fine-tuned bert-base-multilingual-cased obtained the [...] Read more.
This paper examines how to apply deep learning techniques to Mongolian bar exam questions. Several approaches that utilize eight different fine-tuned transformer models were demonstrated for recognizing textual inference in Mongolian bar exam questions. Among eight different models, the fine-tuned bert-base-multilingual-cased obtained the best accuracy of 0.7619. The fine-tuned bert-base-multilingual-cased was capable of recognizing “contradiction”, with a recall of 0.7857 and an F1 score of 0.7674; it recognized “entailment” with a precision of 0.7750, a recall of 0.7381, and an F1 score of 0.7561. Moreover, the fine-tuned bert-large-mongolian-uncased showed balanced performance in recognizing textual inference in Mongolian bar exam questions, thus achieving a precision of 0.7561, a recall of 0.7381, and an F1 score of 0.7470 for recognizing “contradiction”. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

17 pages, 923 KiB  
Article
Ontology Attention Layer for Medical Named Entity Recognition
by Yue Zha, Yuanzhi Ke, Xiao Hu and Caiquan Xiong
Appl. Sci. 2024, 14(1), 421; https://doi.org/10.3390/app14010421 - 3 Jan 2024
Viewed by 1809
Abstract
Named entity recognition (NER) is particularly challenging for medical texts due to the high domain specificity, abundance of technical terms, and sparsity of data in this field. In this work, we propose a novel attention layer, called the “ontology attention layer”, that enhances [...] Read more.
Named entity recognition (NER) is particularly challenging for medical texts due to the high domain specificity, abundance of technical terms, and sparsity of data in this field. In this work, we propose a novel attention layer, called the “ontology attention layer”, that enhances the NER performance of a language model for clinical text by utilizing an ontology consisting of conceptual classes related to the target entity set. The proposed layer computes the relevance between each input token and the classes in the ontology and then fuses the encoded token vectors and the class vectors to enhance the token vectors by explicit superior knowledge. In our experiments, we apply the proposed layer to various language models for an NER task based on a Chinese clinical dataset to evaluate the performance of the layer. We also investigate the influence of the granularity of the classes utilized in the ontology attention layer. The experimental results show that the proposed ontology attention layer improved F1 scores by 0.4% to 0.5%. The results suggest that the proposed method is an effective approach to improving the NER performance of existing language models for clinical datasets. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

23 pages, 637 KiB  
Article
Unsupervised Domain Adaptation via Weighted Sequential Discriminative Feature Learning for Sentiment Analysis
by Haidi Badr, Nayer Wanas and Magda Fayek
Appl. Sci. 2024, 14(1), 406; https://doi.org/10.3390/app14010406 - 1 Jan 2024
Cited by 1 | Viewed by 1670
Abstract
Unsupervised domain adaptation (UDA) presents a significant challenge in sentiment analysis, especially when faced with differences between source and target domains. This study introduces Weighted Sequential Unsupervised Domain Adaptation (WS-UDA), a novel sequential framework aimed at discovering more profound features and improving target [...] Read more.
Unsupervised domain adaptation (UDA) presents a significant challenge in sentiment analysis, especially when faced with differences between source and target domains. This study introduces Weighted Sequential Unsupervised Domain Adaptation (WS-UDA), a novel sequential framework aimed at discovering more profound features and improving target representations, even in resource-limited scenarios. WS-UDA utilizes a domain-adversarial learning model for sequential discriminative feature learning. While recent UDA techniques excel in scenarios where source and target domains are closely related, they struggle with substantial dissimilarities. This potentially leads to instability during shared-feature learning. To tackle this issue, WS-UDA employs a two-stage transfer process concurrently, significantly enhancing model stability and adaptability. The sequential approach of WS-UDA facilitates superior adaptability to varying levels of dissimilarity between source and target domains. Experimental results on benchmark datasets, including Amazon reviews, FDU-MTL datasets, and Spam datasets, demonstrate the promising performance of WS-UDA. It outperforms state-of-the-art cross-domain unsupervised baselines, showcasing its efficacy in scenarios with dissimilar domains. WS-UDA’s adaptability extends beyond sentiment analysis, making it a versatile solution for diverse text classification tasks. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

14 pages, 417 KiB  
Article
TChecker: A Content Enrichment Approach for Fake News Detection on Social Media
by Nada GabAllah, Hossam Sharara and Ahmed Rafea
Appl. Sci. 2023, 13(24), 13070; https://doi.org/10.3390/app132413070 - 7 Dec 2023
Viewed by 1675
Abstract
The spread of fake news on social media continues to be one of the main challenges facing internet users, prohibiting them from discerning authentic from fabricated pieces of information. Hence, identifying the veracity of the content in social posts becomes an important challenge, [...] Read more.
The spread of fake news on social media continues to be one of the main challenges facing internet users, prohibiting them from discerning authentic from fabricated pieces of information. Hence, identifying the veracity of the content in social posts becomes an important challenge, especially with more people continuing to use social media as their main channel for news consumption. Although a number of machine learning models were proposed in the literature to tackle this challenge, the majority rely on the textual content of the post to identify its veracity, which poses a limitation to the performance of such models, especially on platforms where the content of the users’ post is limited (e.g., Twitter, where each post is limited to 140 characters). In this paper, we propose a deep-learning approach for tackling the fake news detection problem that incorporates the content of both the social post and the associated news article as well as the context of the social post, coined TChecker. Throughout the experiments, we use the benchmark dataset FakeNewsNet to illustrate that our proposed model (TChecker) is able to achieve higher performance across all metrics against a number of baseline models that utilize the social content only as well as models combining both social and news content. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

13 pages, 1576 KiB  
Article
Chinese Named Entity Recognition in Football Based on ALBERT-BiLSTM Model
by Qi An, Bingyu Pan, Zhitong Liu, Shutong Du and Yixiong Cui
Appl. Sci. 2023, 13(19), 10814; https://doi.org/10.3390/app131910814 - 28 Sep 2023
Cited by 4 | Viewed by 1459
Abstract
Football is one of the most popular sports in the world, arousing a wide range of research topics related to its off- and on-the-pitch performance. The extraction of football entities from football news helps to construct sports frameworks, integrate sports resources, and timely [...] Read more.
Football is one of the most popular sports in the world, arousing a wide range of research topics related to its off- and on-the-pitch performance. The extraction of football entities from football news helps to construct sports frameworks, integrate sports resources, and timely capture the dynamics of the sports through visual text mining results, including the connections among football players, football clubs, and football competitions, and it is of great convenience to observe and analyze the developmental tendencies of football. Therefore, in this paper, we constructed a 1000,000-word Chinese corpus in the field of football and proposed a BiLSTM-based model for named entity recognition. The ALBERT-BiLSTM combination model of deep learning is used for entity extraction of football textual data. Based on the BiLSTM model, we introduced ALBERT as a pre-training model to extract character and enhance the generalization ability of word embedding vectors. We then compared the results of two different annotation schemes, BIO and BIOE, and two deep learning models, ALBERT-BiLSTM-CRF and ALBERT BiLSTM. It was verified that the BIOE tagging was superior than BIO, and the ALBERT-BiLSTM model was more suitable for football datasets. The precision, recall, and F-Score of the model were 85.4%, 83.47%, and 84.37%, correspondingly. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

15 pages, 444 KiB  
Article
Integrating Text Classification into Topic Discovery Using Semantic Embedding Models
by Ana Laura Lezama-Sánchez, Mireya Tovar Vidal and José A. Reyes-Ortiz
Appl. Sci. 2023, 13(17), 9857; https://doi.org/10.3390/app13179857 - 31 Aug 2023
Cited by 1 | Viewed by 1879
Abstract
Topic discovery involves identifying the main ideas within large volumes of textual data. It indicates recurring topics in documents, providing an overview of the text. Current topic discovery models receive the text, with or without pre-processing, including stop word removal, text cleaning, and [...] Read more.
Topic discovery involves identifying the main ideas within large volumes of textual data. It indicates recurring topics in documents, providing an overview of the text. Current topic discovery models receive the text, with or without pre-processing, including stop word removal, text cleaning, and normalization (lowercase conversion). A topic discovery process that receives general domain text with or without processing generates general topics. General topics do not offer detailed overviews of the input text, and manual text categorization is tedious and time-consuming. Extracting topics from text with an automatic classification task is necessary to generate specific topics enriched with top words that maintain semantic relationships among them. Therefore, this paper presents an approach that integrates text classification for topic discovery from large amounts of English textual data, such as 20-Newsgroups and Reuters Corpora. We rely on integrating automatic text classification before the topic discovery process to obtain specific topics for each class with relevant semantic relationships between top words. Text classification performs a word analysis that makes up a document to decide what class or category to identify; then, the proposed integration provides latent and specific topics depicted by top words with high coherence from each obtained class. Text classification accomplishes this with a convolutional neural network (CNN), incorporating an embedding model based on semantic relationships. Topic discovery over categorized text is realized with latent Dirichlet analysis (LDA), probabilistic latent semantic analysis (PLSA), and latent semantic analysis (LSA) algorithms. An evaluation process for topic discovery over categorized text was performed based on the normalized topic coherence metric. The 20-Newsgroups corpus was classified, and twenty topics with the ten top words were identified for each class. The normalized topic coherence obtained was 0.1723 with LDA, 0.1622 with LSA, and 0.1716 with PLSA. The Reuters Corpus was also classified, and twenty and fifty topics were identified. A normalized topic coherence of 0.1441 was achieved when applying the LDA algorithm, obtaining 20 topics for each class; with LSA, the coherence was 0.1360, and with PLSA, it was 0.1436. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

14 pages, 325 KiB  
Article
Improving Abstractive Dialogue Summarization Using Keyword Extraction
by Chongjae Yoo and Hwanhee Lee
Appl. Sci. 2023, 13(17), 9771; https://doi.org/10.3390/app13179771 - 29 Aug 2023
Cited by 2 | Viewed by 1964
Abstract
Abstractive dialogue summarization aims to generate a short passage that contains important content for a particular dialogue spoken by multiple speakers. In abstractive dialogue summarization systems, capturing the subject in the dialogue is challenging owing to the properties of colloquial texts. Moreover, the [...] Read more.
Abstractive dialogue summarization aims to generate a short passage that contains important content for a particular dialogue spoken by multiple speakers. In abstractive dialogue summarization systems, capturing the subject in the dialogue is challenging owing to the properties of colloquial texts. Moreover, the system often generates uninformative summaries. In this paper, we propose a novel keyword-aware dialogue summarization system (KADS) that easily captures the subject in the dialogue to alleviate the problem mentioned above through the efficient usage of keywords. Specifically, we first extract the keywords from the input dialogue using a pre-trained keyword extractor. Subsequently, KADS efficiently leverages the keywords information of the dialogue to the transformer-based dialogue system by using the pre-trained keyword extractor. Extensive experiments performed on three benchmark datasets show that the proposed method outperforms the baseline system. Additionally, we demonstrate that the proposed keyword-aware dialogue summarization system exhibits a high-performance gain in low-resource conditions where the number of training examples is highly limited. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

18 pages, 2973 KiB  
Article
Sentiment Analysis of Semantically Interoperable Social Media Platforms Using Computational Intelligence Techniques
by Ali Alqahtani, Surbhi Bhatia Khan, Jarallah Alqahtani, Sultan AlYami and Fayez Alfayez
Appl. Sci. 2023, 13(13), 7599; https://doi.org/10.3390/app13137599 - 27 Jun 2023
Cited by 1 | Viewed by 1715
Abstract
Competitive intelligence in social media analytics has significantly influenced behavioral finance worldwide in recent years; it is continuously emerging with a high growth rate of unpredicted variables per week. Several surveys in this large field have proved how social media involvement has made [...] Read more.
Competitive intelligence in social media analytics has significantly influenced behavioral finance worldwide in recent years; it is continuously emerging with a high growth rate of unpredicted variables per week. Several surveys in this large field have proved how social media involvement has made a trackless network using machine learning techniques through web applications and Android modes using interoperability. This article proposes an improved social media sentiment analytics technique to predict the individual state of mind of social media users and the ability of users to resist profound effects. The proposed estimation function tracks the counts of the aversion and satisfaction levels of each inter- and intra-linked expression. It tracks down more than one ontologically linked activity from different social media platforms with a high average success rate of 99.71%. The accuracy of the proposed solution is 97% satisfactory, which could be effectively considered in various industrial solutions such as emo-robot building, patient analysis and activity tracking, elderly care, and so on. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

17 pages, 11057 KiB  
Article
“Standard Text” Relational Classification Model Based on Concatenated Word Vector Attention and Feature Concatenation
by Xize Liu, Jiakai Tian, Nana Niu, Jingsheng Li and Jiajia Han
Appl. Sci. 2023, 13(12), 7119; https://doi.org/10.3390/app13127119 - 14 Jun 2023
Cited by 2 | Viewed by 1150
Abstract
The task of relation classification is an important pre-task in natural language processing tasks. Relation classification can provide a high-quality corpus for tasks such as machine translation, human–computer dialogue, and structured text generation. In the process of the digitalization of standards, identifying the [...] Read more.
The task of relation classification is an important pre-task in natural language processing tasks. Relation classification can provide a high-quality corpus for tasks such as machine translation, human–computer dialogue, and structured text generation. In the process of the digitalization of standards, identifying the entity relationship in the standard text is an important prerequisite for the formation of subsequent standard knowledge. Only by accurately labeling the relationship between entities can there be higher efficiency and accuracy in the subsequent formation of knowledge bases and knowledge maps. This study proposes a standard text relational classification model based on cascaded word vector attention and feature splicing. The model was compared and ablated on our labeled standard text Chinese dataset. At the same time, in order to prove the performance of the model, the above experiments were carried out on two general English datasets, SemEval-2010 Task 8 and KBP37. On standard text datasets and general datasets, the model proposed in this study achieved excellent results. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

17 pages, 2009 KiB  
Article
Hyperparameter Optimization of Ensemble Models for Spam Email Detection
by Temidayo Oluwatosin Omotehinwa and David Opeoluwa Oyewola
Appl. Sci. 2023, 13(3), 1971; https://doi.org/10.3390/app13031971 - 3 Feb 2023
Cited by 17 | Viewed by 5073
Abstract
Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, [...] Read more.
Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. This study, therefore, developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset. The developed ensemble models were then optimized using the grid-search cross-validation technique to search the hyperparameter space for optimal hyperparameter values. The performance of the baseline (un-tuned) and the tuned models of both algorithms were evaluated and compared. The impact of hyperparameter tuning on both models was also examined. The findings of the experimental study revealed that the hyperparameter tuning improved the performance of both models when compared with the baseline models. The tuned RF and XGBoost models achieved an accuracy of 97.78% and 98.09%, a sensitivity of 98.44% and 98.84%, and an F1 score of 97.85% and 98.16%, respectively. The XGBoost model outperformed the random forest model. The developed XGBoost model is effective and efficient for spam email detection. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

16 pages, 2457 KiB  
Article
Crowd Control, Planning, and Prediction Using Sentiment Analysis: An Alert System for City Authorities
by Tariq Malik, Najma Hanif, Ahsen Tahir, Safeer Abbas, Muhammad Shoaib Hanif, Faiza Tariq, Shuja Ansari, Qammer Hussain Abbasi and Muhammad Ali Imran
Appl. Sci. 2023, 13(3), 1592; https://doi.org/10.3390/app13031592 - 26 Jan 2023
Cited by 1 | Viewed by 3060
Abstract
Modern means of communication, economic crises, and political decisions play imperative roles in reshaping political and administrative systems throughout the world. Twitter, a micro-blogging website, has gained paramount importance in terms of public opinion-sharing. Manual intelligence of law enforcement agencies (i.e., in changing [...] Read more.
Modern means of communication, economic crises, and political decisions play imperative roles in reshaping political and administrative systems throughout the world. Twitter, a micro-blogging website, has gained paramount importance in terms of public opinion-sharing. Manual intelligence of law enforcement agencies (i.e., in changing situations) cannot cope in real time. Thus, to address this problem, we built an alert system for government authorities in the province of Punjab, Pakistan. The alert system gathers real-time data from Twitter in English and Roman Urdu about forthcoming gatherings (protests, demonstrations, assemblies, rallies, sit-ins, marches, etc.). To determine public sentiment regarding upcoming anti-government gatherings (protests, demonstrations, assemblies, rallies, sit-ins, marches, etc.), the alert system determines the polarity of tweets. Using keywords, the system provides information for future gatherings by extracting the entities like date, time, and location from Twitter data obtained in real time. Our system was trained and tested with different machine learning (ML) algorithms, such as random forest (RF), decision tree (DT), support vector machine (SVM), multinomial naïve Bayes (MNB), and Gaussian naïve Bayes (GNB), along with two vectorization techniques, i.e., term frequency–inverse document frequency (TFIDF) and count vectorization. Moreover, this paper compares the accuracy results of sentiment analysis (SA) of Twitter data by applying supervised machine learning (ML) algorithms. In our research experiment, we used two data sets, i.e., a small data set of 1000 tweets and a large data set of 4000 tweets. Results showed that RF along with count vectorization performed best for the small data set with an accuracy of 82%; with the large data set, MNB along with count vectorization outperformed all other classifiers with an accuracy of 75%. Additionally, language models, e.g., bigram and trigram, were used to generate the word clouds of positive and negative words to visualize the most frequently used words. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

19 pages, 2598 KiB  
Article
Integration of Sentiment Analysis of Social Media in the Strategic Planning Process to Generate the Balanced Scorecard
by José Roberto Grande-Ramírez, Eduardo Roldán-Reyes, Alberto A. Aguilar-Lasserre and Ulises Juárez-Martínez
Appl. Sci. 2022, 12(23), 12307; https://doi.org/10.3390/app122312307 - 1 Dec 2022
Cited by 4 | Viewed by 3453
Abstract
Strategic planning (SP) requires attention and constant updating and is a crucial process for guaranteeing the efficient performance of companies. This article proposes a novel approach applied in a case study whereby a balanced scorecard (BSC) was generated that integrated sentiment analysis (SA) [...] Read more.
Strategic planning (SP) requires attention and constant updating and is a crucial process for guaranteeing the efficient performance of companies. This article proposes a novel approach applied in a case study whereby a balanced scorecard (BSC) was generated that integrated sentiment analysis (SA) of social media (SM) and took advantage of the valuable knowledge of these sources. In this study, opinions were consolidated in the main dataset to incorporate sentiments regarding the strategic part of a restaurant in a tourist city. The proposed methodology began with the selection of the company. Information was then acquired to apply pre-processing, processing, evaluation, and validation that is capitalized in a BSC to support strategic decision-making. Python support was used in the model and comprised lexicon and machine learning approaches for the SA. The significant knowledge in the comments was automatically oriented toward the key performance indicators (KPIs) and perspectives of a BSC that were previously determined by a group of opinion leaders of the company. The methods, techniques, and algorithms of SA and SP showed that unstructured textual information can be processed and capitalized efficiently for optimal management and decision-making. The results revealed an improvement (reduced effort and time) to produce a more robust and comprehensive BSC with the support and validation of experts. Moreover, new resources and approaches were developed to implement more efficient SP. The model was based on the efficient coupling of both fields of study. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

27 pages, 928 KiB  
Article
An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication
by Edwin Carlos Montiel-Vázquez, Jorge Adolfo Ramírez Uresti and Octavio Loyola-González
Appl. Sci. 2022, 12(19), 9407; https://doi.org/10.3390/app12199407 - 20 Sep 2022
Cited by 6 | Viewed by 4026
Abstract
Empathy is a necessary component of human communication. However, it has been largely ignored in favor of other concepts such as emotion and feeling in Affective computing. Research that has been carried out regarding empathy in computer science lacks a method of measuring [...] Read more.
Empathy is a necessary component of human communication. However, it has been largely ignored in favor of other concepts such as emotion and feeling in Affective computing. Research that has been carried out regarding empathy in computer science lacks a method of measuring empathy based on psychological research. Likewise, it does not present an avenue for expanding knowledge regarding this concept. We provide a comprehensive study on the nature of empathy and a method for detecting it in textual communication. We measured empathy present in conversations from a database through volunteers and psychological research. Subsequently, we made use of a pattern-based classification algorithm to predict the Empathy levels in each conversation. Our research contributions are: the Empathy score, a metric for measuring empathy in texts; Empathetic Conversations, a database containing conversations with their respective Empathy score; and our results. We show that an explicative pattern-based approach (PBC4cip) is, to date, the best approach for detecting empathy in texts. This is by measuring performance in both nominal and ordinal metrics. We found a statistically significant difference in performance for our approach and other algorithms with lower performance. In addition, we show the advantages of interpretability by our model in contrast to other approaches. This is one of the first approaches to measuring empathy in texts, and we expect it to be useful for future research. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

15 pages, 1817 KiB  
Article
Re-Engineered Word Embeddings for Improved Document-Level Sentiment Analysis
by Su Yang and Farzin Deravi
Appl. Sci. 2022, 12(18), 9287; https://doi.org/10.3390/app12189287 - 16 Sep 2022
Viewed by 2157
Abstract
In this paper, a novel re-engineering mechanism for the generation of word embeddings is proposed for document-level sentiment analysis. Current approaches to sentiment analysis often integrate feature engineering with classification, without optimizing the feature vectors explicitly. Engineering feature vectors to match the data [...] Read more.
In this paper, a novel re-engineering mechanism for the generation of word embeddings is proposed for document-level sentiment analysis. Current approaches to sentiment analysis often integrate feature engineering with classification, without optimizing the feature vectors explicitly. Engineering feature vectors to match the data between the training set and query sample as proposed in this paper could be a promising way for boosting the classification performance in machine learning applications. The proposed mechanism is designed to re-engineer the feature components from a set of embedding vectors for greatly increased between-class separation, hence better leveraging the informative content of the documents. The proposed mechanism was evaluated using four public benchmarking datasets for both two-way and five-way semantic classifications. The resulting embeddings have demonstrated substantially improved performance for a range of sentiment analysis tasks. Tests using all the four datasets achieved by far the best classification results compared with the state-of-the-art. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

16 pages, 1821 KiB  
Article
Informative Language Encoding by Variational Autoencoders Using Transformer
by Changwon Ok, Geonseok Lee and Kichun Lee
Appl. Sci. 2022, 12(16), 7968; https://doi.org/10.3390/app12167968 - 9 Aug 2022
Cited by 3 | Viewed by 3322
Abstract
In natural language processing (NLP), Transformer is widely used and has reached the state-of-the-art level in numerous NLP tasks such as language modeling, summarization, and classification. Moreover, a variational autoencoder (VAE) is an efficient generative model in representation learning, combining deep learning with [...] Read more.
In natural language processing (NLP), Transformer is widely used and has reached the state-of-the-art level in numerous NLP tasks such as language modeling, summarization, and classification. Moreover, a variational autoencoder (VAE) is an efficient generative model in representation learning, combining deep learning with statistical inference in encoded representations. However, the use of VAE in natural language processing often brings forth practical difficulties such as a posterior collapse, also known as Kullback–Leibler (KL) vanishing. To mitigate this problem, while taking advantage of the parallelization of language data processing, we propose a new language representation model as the integration of two seemingly different deep learning models, which is a Transformer model solely coupled with a variational autoencoder. We compare the proposed model with previous works, such as a VAE connected with a recurrent neural network (RNN). Our experiments with four real-life datasets show that implementation with KL annealing mitigates posterior collapses. The results also show that the proposed Transformer model outperforms RNN-based models in reconstruction and representation learning, and that the encoded representations of the proposed model are more informative than other tested models. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

24 pages, 5111 KiB  
Article
Post-Authorship Attribution Using Regularized Deep Neural Network
by Abiodun Modupe, Turgay Celik, Vukosi Marivate and Oludayo O. Olugbara
Appl. Sci. 2022, 12(15), 7518; https://doi.org/10.3390/app12157518 - 26 Jul 2022
Cited by 6 | Viewed by 3034
Abstract
Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process [...] Read more.
Post-authorship attribution is a scientific process of using stylometric features to identify the genuine writer of an online text snippet such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process to proactively detect misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks. The process assumes that texts can be characterized by sequences of words that agglutinate the functional and content lyrics of a writer. However, defining an appropriate characterization of text to capture the unique writing style of an author is a complex endeavor in the discipline of computational linguistics. Moreover, posts are typically short texts with obfuscating vocabularies that might impact the accuracy of authorship attribution. The vocabularies include idioms, onomatopoeias, homophones, phonemes, synonyms, acronyms, anaphora, and polysemy. The method of the regularized deep neural network (RDNN) is introduced in this paper to circumvent the intrinsic challenges of post-authorship attribution. It is based on a convolutional neural network, bidirectional long short-term memory encoder, and distributed highway network. The neural network was used to extract lexical stylometric features that are fed into the bidirectional encoder to extract a syntactic feature-vector representation. The feature vector was then supplied as input to the distributed high networks for regularization to minimize the network-generalization error. The regularized feature vector was ultimately passed to the bidirectional decoder to learn the writing style of an author. The feature-classification layer consists of a fully connected network and a SoftMax function to make the prediction. The RDNN method was tested against thirteen state-of-the-art methods using four benchmark experimental datasets to validate its performance. Experimental results have demonstrated the effectiveness of the method when compared to the existing state-of-the-art methods on three datasets while producing comparable results on one dataset. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

16 pages, 3673 KiB  
Article
Stepwise Multi-Task Learning Model for Holder Extraction in Aspect-Based Sentiment Analysis
by Ho-Min Park and Jae-Hoon Kim
Appl. Sci. 2022, 12(13), 6777; https://doi.org/10.3390/app12136777 - 4 Jul 2022
Cited by 2 | Viewed by 1620
Abstract
Aspect-based sentiment analysis is a text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one and a task for a fine-grained sentiment analysis. In order to accurately perform a fine-grained sentiment analysis, a sentiment word within a [...] Read more.
Aspect-based sentiment analysis is a text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one and a task for a fine-grained sentiment analysis. In order to accurately perform a fine-grained sentiment analysis, a sentiment word within a text, a target it modifies, and a holder who represents the sentiment word are required; however, they should be extracted in sequence because the sentiment word is an important clue for extracting the target, which is key evidence of the holder. Namely, the three types of information sequentially become an important clue. Therefore, in this paper, we propose a stepwise multi-task learning model for holder extraction with RoBERTa and Bi-LSTM. The tasks are sentiment word extraction, target extraction, and holder extraction. The proposed model was trained and evaluated under Laptop and Restaurant datasets in SemEval 2014 through 2016. We have observed that the performance of the proposed model was improved by using stepwised features that are the output of the previous task. Furthermore, the generalization effect has been observed by making the final output format of the model a BIO tagging scheme. This can avoid overfitting to a specific domain of the review text by outputting BIO tags instead of the words. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

15 pages, 1771 KiB  
Article
Fake Sentence Detection Based on Transfer Learning: Applying to Korean COVID-19 Fake News
by Jeong-Wook Lee and Jae-Hoon Kim
Appl. Sci. 2022, 12(13), 6402; https://doi.org/10.3390/app12136402 - 23 Jun 2022
Cited by 9 | Viewed by 2516
Abstract
With the increasing number of social media users in recent years, news in various fields, such as politics, economics, and so on, can be easily accessed by users. However, most news spread through social networks including Twitter, Facebook, and Instagram has unknown sources, [...] Read more.
With the increasing number of social media users in recent years, news in various fields, such as politics, economics, and so on, can be easily accessed by users. However, most news spread through social networks including Twitter, Facebook, and Instagram has unknown sources, thus having a significant impact on news consumers. Fake news on COVID-19, which is affecting the global population, is propagating quickly and causes social disorder. Thus, a lot of research is being conducted on the detection of fake news on COVID-19 but is facing the problem of a lack of datasets. In order to alleviate the problem, we built a dataset on COVID-19 fake news from fact-checking websites in Korea and propose deep learning for detecting fake news on COVID-19 using the datasets. The proposed model is pre-trained with large-scale data and then performs transfer learning through a BiLSTM model. Moreover, we propose a method for initializing the hidden and cell states of the BiLSTM model to a [CLS] token instead of a zero vector. Through experiments, the proposed model showed that the accuracy is 78.8%, which was improved by 8% compared with the linear model as a baseline model, and that transfer learning can be useful with a small amount of data as we know it. A [CLS] token containing sentence information as the initial state of the BiLSTM can contribute to a performance improvement in the model. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

9 pages, 1704 KiB  
Article
Korean Semantic Role Labeling with Bidirectional Encoder Representations from Transformers and Simple Semantic Information
by Jangseong Bae and Changki Lee
Appl. Sci. 2022, 12(12), 5995; https://doi.org/10.3390/app12125995 - 13 Jun 2022
Cited by 1 | Viewed by 1688
Abstract
State-of-the-art semantic role labeling (SRL) performance has been achieved using neural network models by incorporating syntactic feature information such as dependency trees. In recent years, breakthroughs achieved using end-to-end neural network models have resulted in a state-of-the-art SRL performance even without syntactic features. [...] Read more.
State-of-the-art semantic role labeling (SRL) performance has been achieved using neural network models by incorporating syntactic feature information such as dependency trees. In recent years, breakthroughs achieved using end-to-end neural network models have resulted in a state-of-the-art SRL performance even without syntactic features. With the advent of a language model called bidirectional encoder representations from transformers (BERT), another breakthrough was witnessed. Even though the semantic information of each word constituting a sentence is important in determining the meaning of a word, previous studies regarding the end-to-end neural network method did not utilize semantic information. In this study, we propose a BERT-based SRL model that uses simple semantic information without syntactic feature information. To obtain the latter, we used PropBank, which described the relational information between predicates and arguments. In addition, text-originated feature information obtained from the training text data was utilized. Our proposed model achieved state-of-the-art results on both Korean PropBank and CoNLL-2009 English benchmarks. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

16 pages, 4074 KiB  
Article
Citation Context Analysis Using Combined Feature Embedding and Deep Convolutional Neural Network Model
by Musarat Karim, Malik Muhammad Saad Missen, Muhammad Umer, Saima Sadiq, Abdullah Mohamed and Imran Ashraf
Appl. Sci. 2022, 12(6), 3203; https://doi.org/10.3390/app12063203 - 21 Mar 2022
Cited by 18 | Viewed by 3714
Abstract
Citation creates a link between citing and the cited author, and the frequency of citation has been regarded as the basic element to measure the impact of research and knowledge-based achievements. Citation frequency has been widely used to calculate the impact factor, H [...] Read more.
Citation creates a link between citing and the cited author, and the frequency of citation has been regarded as the basic element to measure the impact of research and knowledge-based achievements. Citation frequency has been widely used to calculate the impact factor, H index, i10 index, etc., of authors and journals. However, for a fair evaluation, the qualitative aspect should be considered along with the quantitative measures. The sentiments expressed in citation play an important role in evaluating the quality of the research because the citation may be used to indicate appreciation, criticism, or a basis for carrying on research. In-text citation analysis is a challenging task, despite the use of machine learning models and automatic sentiment annotation. Additionally, the use of deep learning models and word embedding is not studied very well. This study performs several experiments with machine learning and deep learning models using fastText, fastText subword, global vectors, and their blending for word representation to perform in-text sentiment analysis. A dimensionality reduction technique called principal component analysis (PCA) is utilized to reduce the feature vectors before passing them to the classifier. Additionally, a customized convolutional neural network (CNN) is presented to obtain higher classification accuracy. Results suggest that the deep learning CNN coupled with fastText word embedding produces the best results in terms of accuracy, precision, recall, and F1 measure. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

44 pages, 7017 KiB  
Article
Using Conceptual Recurrence and Consistency Metrics for Topic Segmentation in Debate
by Jaejong Ho, Hyoji Ha, Seok-Won Lee and Kyungwon Lee
Appl. Sci. 2022, 12(6), 2952; https://doi.org/10.3390/app12062952 - 14 Mar 2022
Viewed by 2263
Abstract
We propose a topic segmentation model, CSseg (Conceptual Similarity-segmenter), for debates based on conceptual recurrence and debate consistency metrics. We research whether the conceptual similarity of conceptual recurrence and debate consistency metrics relate to topic segmentation. Conceptual similarity is a similarity between utterances [...] Read more.
We propose a topic segmentation model, CSseg (Conceptual Similarity-segmenter), for debates based on conceptual recurrence and debate consistency metrics. We research whether the conceptual similarity of conceptual recurrence and debate consistency metrics relate to topic segmentation. Conceptual similarity is a similarity between utterances in conceptual recurrence analysis, and debate consistency metrics represent the internal coherence properties that maintain the debate topic in interactions between participants. Based on the research question, CSseg segments transcripts by applying similarity cohesion methods based on conceptual similarities; the topic segmentation is affected by applying weights to conceptual similarities having debate internal consistency properties, including other-continuity, self-continuity, chains of arguments and counterarguments, and the topic guide of moderator. CSseg provides a user-driven topic segmentation by allowing the user to adjust the weights of the similarity cohesion methods and debate consistency metrics. It takes an approach that alleviates the problem whereby each person judges the topic segments differently in debates and multi-party discourse. We implemented the prototype of CSseg by utilizing the Korean TV debate program MBC 100-Minute Debate and analyzed the results by use cases. We compared CSseg and a previous model LCseg (Lexical Cohesion-segmenter) with the evaluation metrics Pk and WD. CSseg had greater performance than LCseg in debates. Full article
(This article belongs to the Special Issue Application of Machine Learning in Text Mining)
Show Figures

Figure 1

Back to TopTop