Text Mining: Challenges, Algorithms, Tools and Applications

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Processes".

Deadline for manuscript submissions: 15 June 2025 | Viewed by 30262

Special Issue Editor

Department of Computer Science & Computer Engineering, La Trobe University, Melbourne, Australia
Interests: sentiment analysis; text summarization; semantic web; logic programming

Special Issue Information

Dear Colleagues,

Text mining has emerged as a prominent field in data mining. From information retrieval, information extraction, and text classification to sentiment analysis and text summarization, text mining plays a significant role in several application fields. In recent years, various mining techniques have been developed, including rule-based and statistics-based models, support vector machines, clustering, neutral networks, and deep learning. In each category, distance and similarity estimation has always been a key issue.

The aim of the Special Issue is to offer an opportunity to publish original research: cutting-edge theories, innovative algorithms, and novel applications. In particular, we welcome manuscripts from text summarization which has been commonly regarded as the most challenging area of text mining. Survey articles describing the state of the art are also welcome.

Topics include, but are not limited to, the following:

  • Information retrieval and extraction;
  • Question-answering systems;
  • Recommendation systems;
  • Security and privacy;
  • Sentiment analysis;
  • Text classification;
  • Text summarization.

Dr. Fei Liu
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • information retrieval
  • information extraction
  • recommendation systems
  • sentiment analysis
  • text summarization
  • text mining
  • machine learning
  • artificial intelligence

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

14 pages, 1006 KiB  
Article
Correlations and Fractality in Sentence-Level Sentiment Analysis Based on VADER for Literary Texts
by Ricardo Hernández-Pérez, Pablo Lara-Martínez, Bibiana Obregón-Quintana, Larry S. Liebovitch and Lev Guzmán-Vargas
Information 2024, 15(11), 698; https://doi.org/10.3390/info15110698 - 4 Nov 2024
Viewed by 554
Abstract
We perform a sentence-level sentiment analysis study of different literary texts in English language. Each text is converted into a series in which the data points are the sentiment value of each sentence obtained using the sentiment analysis tool (VADER). By applying the [...] Read more.
We perform a sentence-level sentiment analysis study of different literary texts in English language. Each text is converted into a series in which the data points are the sentiment value of each sentence obtained using the sentiment analysis tool (VADER). By applying the Detrended Fluctuation Analysis (DFA) and the Higuchi Fractal Dimension (HFD) methods to these sentiment series, we find that they are monofractal with long-term correlations, which can be explained by the fact that the writing process has memory by construction, with a sentiment evolution that is self-similar. Furthermore, we discretize these series by applying a classification approach which transforms the series into a one on which each data point has only three possible values, corresponding to positive, neutral or negative sentiments. We map these three-states series to a Markov chain and investigate the transitions of sentiment from one sentence to the next, obtaining a state transition matrix for each book that provides information on the probability of transitioning between sentiments from one sentence to the next. This approach shows that there are biases towards increasing the probability of switching to neutral or positive sentences. The two approaches supplement each other, since the long-term correlation approach allows a global assessment of the sentiment of the book, while the state transition matrix approach provides local information about the sentiment evolution along the text. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

20 pages, 391 KiB  
Article
Uncovering Tourist Visit Intentions on Social Media through Sentence Transformers
by Paolo Fantozzi, Guglielmo Maccario and Maurizio Naldi
Information 2024, 15(10), 603; https://doi.org/10.3390/info15100603 - 30 Sep 2024
Viewed by 685
Abstract
The problem of understanding and predicting tourist behavior in choosing their destinations is a long-standing one. The first step in the process is to understand users’ intention to visit a country, which may later translate into an actual visit. Would-be tourists may express [...] Read more.
The problem of understanding and predicting tourist behavior in choosing their destinations is a long-standing one. The first step in the process is to understand users’ intention to visit a country, which may later translate into an actual visit. Would-be tourists may express their intention to visit a destination on social media. Being able to predict their intention may be useful for targeted promotion campaigns. In this paper, we propose an algorithm to predict visit (or revisit) intentions based on the texts in posts on social media. The algorithm relies on a neural network sentence-transformer architecture using optimized embedding and a logistic classifier. Employing two real labeled datasets from Twitter (now X) for training, the algorithm achieved 90% accuracy and balanced performances over the two classes (visit intention vs. no-visit intention). The algorithm was capable of predicting intentions to visit with high accuracy, even when fed with very imbalanced datasets, where the posts showing the intention to visit were an extremely small minority. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

17 pages, 3327 KiB  
Article
Automated Knowledge Extraction in the Field of Wheat Sharp Eyespot Control
by Keyi Liu and Yunpeng Cui
Information 2024, 15(7), 367; https://doi.org/10.3390/info15070367 - 21 Jun 2024
Viewed by 924
Abstract
Wheat sharp eyespot is a soil-borne fungal disease commonly found in wheat areas in China, which can occur throughout the entire reproductive period of wheat and has a great impact on the yield and quality of wheat in China. By constructing a domain [...] Read more.
Wheat sharp eyespot is a soil-borne fungal disease commonly found in wheat areas in China, which can occur throughout the entire reproductive period of wheat and has a great impact on the yield and quality of wheat in China. By constructing a domain ontology for wheat sharp eyespot control and modeling the domain knowledge, we aim to integrate and share the knowledge in the field of wheat sharp eyespot control, which can provide important support and guidance for agricultural decision-making and disease control. In this study, the literature in the field of wheat sharp eyespot control was used as a data source, the KeyBERT keyword extraction algorithm was used to mine the core concepts of the ontology, and the hierarchical relationships among the ontology concepts were extracted through clustering. Based on the constructed ontology of wheat sharp eyespot control, the schema of knowledge extraction was formed, and the knowledge extraction model was trained using the ERNIE 3.0 knowledge enhancement pretraining model. This study proposes a model and algorithm to realize knowledge extraction based on domain ontology, describes the construction method and process framework of wheat sharp eyespot control domain ontology, and details the training and reasoning effect of the knowledge extraction model. The knowledge extraction model constructed in this study for wheat sharp eyespot control contains a more complete conceptual system of wheat sharp eyespot. The F1 value of the model reaches 91.26%, which is a 17.86% improvement compared with the baseline model, and it can satisfy the knowledge extraction needs in the field of wheat sharp eyespot control. This study can provide a reference for domain knowledge extraction and provide strong support for knowledge discovery and downstream applications such as intelligent Q&A and intelligent recommendation in the field of wheat sharp eyespot control. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

30 pages, 1001 KiB  
Article
Genre Classification of Books in Russian with Stylometric Features: A Case Study
by Natalia Vanetik, Margarita Tiamanova, Genady Kogan and Marina Litvak
Information 2024, 15(6), 340; https://doi.org/10.3390/info15060340 - 7 Jun 2024
Viewed by 1122
Abstract
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding [...] Read more.
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

17 pages, 852 KiB  
Article
Domain-Specific Dictionary between Human and Machine Languages
by Md Saiful Islam and Fei Liu
Information 2024, 15(3), 144; https://doi.org/10.3390/info15030144 - 5 Mar 2024
Viewed by 1476
Abstract
In the realm of artificial intelligence, knowledge graphs have become an effective area of research. Relationships between entities are depicted through a structural framework in knowledge graphs. In this paper, we propose to build a domain-specific medicine dictionary (DSMD) based on the principles [...] Read more.
In the realm of artificial intelligence, knowledge graphs have become an effective area of research. Relationships between entities are depicted through a structural framework in knowledge graphs. In this paper, we propose to build a domain-specific medicine dictionary (DSMD) based on the principles of knowledge graphs. Our dictionary is composed of structured triples, where each entity is defined as a concept, and these concepts are interconnected through relationships. This comprehensive dictionary boasts more than 348,000 triples, encompassing over 20,000 medicine brands and 1500 generic medicines. It presents an innovative method of storing and accessing medical data. Our dictionary facilitates various functionalities, including medicine brand information extraction, brand-specific queries, and queries involving two words or question answering. We anticipate that our dictionary will serve a broad spectrum of users, catering to both human users, such as a diverse range of healthcare professionals, and AI applications. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

27 pages, 967 KiB  
Article
Detecting Moral Features in TV Series with a Transformer Architecture through Dictionary-Based Word Embedding
by Paolo Fantozzi, Valentina Rotondi, Matteo Rizzolli, Paola Dalla Torre and Maurizio Naldi
Information 2024, 15(3), 128; https://doi.org/10.3390/info15030128 - 24 Feb 2024
Viewed by 1562
Abstract
Moral features are essential components of TV series, helping the audience to engage with the story, exploring themes beyond sheer entertainment, reflecting current social issues, and leaving a long-lasting impact on the viewers. Their presence shows through the language employed in the plot [...] Read more.
Moral features are essential components of TV series, helping the audience to engage with the story, exploring themes beyond sheer entertainment, reflecting current social issues, and leaving a long-lasting impact on the viewers. Their presence shows through the language employed in the plot description. Their detection helps regarding understanding the series writers’ underlying message. In this paper, we propose an approach to detect moral features in TV series. We rely on the Moral Foundations Theory (MFT) framework to classify moral features and use the associated MFT dictionary to identify the words expressing those features. Our approach combines that dictionary with word embedding and similarity analysis through a deep learning SBERT (Sentence-Bidirectional Encoder Representations from Transformers) architecture to quantify the comparative prominence of moral features. We validate the approach by applying it to the definition of the MFT moral feature labels as appearing in general authoritative dictionaries. We apply our technique to the summaries of a selection of TV series representative of several genres and relate the results to the actual content of each series, showing the consistency of results. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

23 pages, 1268 KiB  
Article
Reimagining Literary Analysis: Utilizing Artificial Intelligence to Classify Modernist French Poetry
by Liu Yang, Gang Wang and Hongjun Wang
Information 2024, 15(2), 70; https://doi.org/10.3390/info15020070 - 24 Jan 2024
Cited by 1 | Viewed by 3334
Abstract
Aligned with global Sustainable Development Goals (SDGs) and multidisciplinary approaches integrating AI with sustainability, this research introduces an innovative AI framework for analyzing Modern French Poetry. It applies feature extraction techniques (TF-IDF and Doc2Vec) and machine learning algorithms (especially SVM) to create a [...] Read more.
Aligned with global Sustainable Development Goals (SDGs) and multidisciplinary approaches integrating AI with sustainability, this research introduces an innovative AI framework for analyzing Modern French Poetry. It applies feature extraction techniques (TF-IDF and Doc2Vec) and machine learning algorithms (especially SVM) to create a model that objectively classifies poems by their stylistic and thematic attributes, transcending traditional subjective analyses. This work demonstrates AI’s potential in literary analysis and cultural exchange, highlighting the model’s capacity to facilitate cross-cultural understanding and enhance poetry education. The efficiency of the AI model, compared to traditional methods, shows promise in optimizing resources and reducing the environmental impact of education. Future research will refine the model’s technical aspects, ensuring effectiveness, equity, and personalization in education. Expanding the model’s scope to various poetic styles and genres will enhance its accuracy and generalizability. Additionally, efforts will focus on an equitable AI tool implementation for quality education access. This research offers insights into AI’s role in advancing poetry education and contributing to sustainability goals. By overcoming the outlined limitations and integrating the model into educational platforms, it sets a path for impactful developments in computational poetry and educational technology. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

27 pages, 4466 KiB  
Article
Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing
by Veronika Belcheva, Tatiana Ermakova and Benjamin Fabian
Information 2023, 14(11), 622; https://doi.org/10.3390/info14110622 - 19 Nov 2023
Cited by 1 | Viewed by 3112
Abstract
Privacy policies are the main method for informing Internet users of how their data are collected and shared. This study aims to analyze the deficiencies of privacy policies in terms of readability, vague statements, and the use of pacifying phrases concerning privacy. This [...] Read more.
Privacy policies are the main method for informing Internet users of how their data are collected and shared. This study aims to analyze the deficiencies of privacy policies in terms of readability, vague statements, and the use of pacifying phrases concerning privacy. This represents the undertaking of a step forward in the literature on this topic through a comprehensive analysis encompassing both time and website coverage. It characterizes trends across website categories, top-level domains, and popularity ranks. Furthermore, studying the development in the context of the General Data Protection Regulation (GDPR) offers insights into the impact of regulations on policy comprehensibility. The findings reveal a concerning trend: privacy policies have grown longer and more ambiguous, making it challenging for users to comprehend them. Notably, there is an increased proportion of vague statements, while clear statements have seen a decrease. Despite this, the study highlights a steady rise in the inclusion of reassuring statements aimed at alleviating readers’ privacy concerns. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

28 pages, 1207 KiB  
Article
Graph-Based Extractive Text Summarization Sentence Scoring Scheme for Big Data Applications
by Jai Prakash Verma, Shir Bhargav, Madhuri Bhavsar, Pronaya Bhattacharya, Ali Bostani, Subrata Chowdhury, Julian Webber and Abolfazl Mehbodniya
Information 2023, 14(9), 472; https://doi.org/10.3390/info14090472 - 22 Aug 2023
Cited by 2 | Viewed by 3474
Abstract
The recent advancements in big data and natural language processing (NLP) have necessitated proficient text mining (TM) schemes that can interpret and analyze voluminous textual data. Text summarization (TS) acts as an essential pillar within recommendation engines. Despite the prevalent use of abstractive [...] Read more.
The recent advancements in big data and natural language processing (NLP) have necessitated proficient text mining (TM) schemes that can interpret and analyze voluminous textual data. Text summarization (TS) acts as an essential pillar within recommendation engines. Despite the prevalent use of abstractive techniques in TS, an anticipated shift towards a graph-based extractive TS (ETS) scheme is becoming apparent. The models, although simpler and less resource-intensive, are key in assessing reviews and feedback on products or services. Nonetheless, current methodologies have not fully resolved concerns surrounding complexity, adaptability, and computational demands. Thus, we propose our scheme, GETS, utilizing a graph-based model to forge connections among words and sentences through statistical procedures. The structure encompasses a post-processing stage that includes graph-based sentence clustering. Employing the Apache Spark framework, the scheme is designed for parallel execution, making it adaptable to real-world applications. For evaluation, we selected 500 documents from the WikiHow and Opinosis datasets, categorized them into five classes, and applied the recall-oriented understudying gisting evaluation (ROUGE) parameters for comparison with measures ROUGE-1, 2, and L. The results include recall scores of 0.3942, 0.0952, and 0.3436 for ROUGE-1, 2, and L, respectively (when using the clustered approach). Through a juxtaposition with existing models such as BERTEXT (with 3-gram, 4-gram) and MATCHSUM, our scheme has demonstrated notable improvements, substantiating its applicability and effectiveness in real-world scenarios. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

15 pages, 2511 KiB  
Article
Aspect-Based Sentiment Analysis with Dependency Relation Weighted Graph Attention
by Tingyao Jiang, Zilong Wang, Ming Yang and Cheng Li
Information 2023, 14(3), 185; https://doi.org/10.3390/info14030185 - 16 Mar 2023
Cited by 7 | Viewed by 2925
Abstract
Aspect-based sentiment analysis is a fine-grained sentiment analysis that focuses on the sentiment polarity of different aspects of text, and most current research methods use a combination of dependent syntactic analysis and graphical neural networks. In this paper, a graph attention network aspect-based [...] Read more.
Aspect-based sentiment analysis is a fine-grained sentiment analysis that focuses on the sentiment polarity of different aspects of text, and most current research methods use a combination of dependent syntactic analysis and graphical neural networks. In this paper, a graph attention network aspect-based sentiment analysis model based on the weighting of dependencies (WGAT) is designed to address the problem in that traditional models do not sufficiently analyse the types of syntactic dependencies; in the proposed model, graph attention networks can be weighted and averaged according to the importance of different nodes when aggregating information. The model first transforms the input text into a low-dimensional word vector through pretraining, while generating a dependency syntax graph by analysing the dependency syntax of the input text and constructing a dependency weighted adjacency matrix according to the importance of different dependencies in the graph. The word vector and the dependency weighted adjacency matrix are then fed into a graph attention network for feature extraction, and sentiment polarity is predicted through the classification layer. The model can focus on syntactic dependencies that are more important for sentiment classification during training, and the results of the comparison experiments on the Semeval-2014 laptop and restaurant datasets and the ACL-14 Twitter social comment dataset show that the WGAT model has significantly improved accuracy and F1 values compared to other baseline models, validating its effectiveness in aspect-level sentiment analysis tasks. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

25 pages, 1487 KiB  
Article
From Text Representation to Financial Market Prediction: A Literature Review
by Saeede Anbaee Farimani, Majid Vafaei Jahan and Amin Milani Fard
Information 2022, 13(10), 466; https://doi.org/10.3390/info13100466 - 29 Sep 2022
Cited by 6 | Viewed by 5296
Abstract
News dissemination in social media causes fluctuations in financial markets. (Scope) Recent advanced methods in deep learning-based natural language processing have shown promising results in financial market analysis. However, understanding how to leverage large amounts of textual data alongside financial market information is [...] Read more.
News dissemination in social media causes fluctuations in financial markets. (Scope) Recent advanced methods in deep learning-based natural language processing have shown promising results in financial market analysis. However, understanding how to leverage large amounts of textual data alongside financial market information is important for the investors’ behavior analysis. In this study, we review over 150 publications in the field of behavioral finance that jointly investigated natural language processing (NLP) approaches and a market data analysis for financial decision support. This work differs from other reviews by focusing on applied publications in computer science and artificial intelligence that contributed to a heterogeneous information fusion for the investors’ behavior analysis. (Goal) We study various text representation methods, sentiment analysis, and information retrieval methods from heterogeneous data sources. (Findings) We present current and future research directions in text mining and deep learning for correlation analysis, forecasting, and recommendation systems in financial markets, such as stocks, cryptocurrencies, and Forex (Foreign Exchange Market). Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

Other

Jump to: Research

25 pages, 1842 KiB  
Systematic Review
Recommender Systems Applications: Data Sources, Features, and Challenges
by Yousef H. Alfaifi
Information 2024, 15(10), 660; https://doi.org/10.3390/info15100660 - 21 Oct 2024
Viewed by 1641
Abstract
In recent years, there has been growing interest in recommendation systems, which is matched by their widespread adoption across various sectors. This can be attributed to their effectiveness in reducing an avalanche of data into individualized information that is meaningful, relevant, and can [...] Read more.
In recent years, there has been growing interest in recommendation systems, which is matched by their widespread adoption across various sectors. This can be attributed to their effectiveness in reducing an avalanche of data into individualized information that is meaningful, relevant, and can easily be absorbed by a single person. Several studies have recently navigated the landscape of recommendation systems, attending to their approaches, challenges, and applications, as well as the evaluation metrics necessary for effective implementation. This systematic review investigates the understudied aspects of recommendation systems, including the data input into the systems and their features or outputs. The data in (input) and data out (features) are both diverse and vary significantly from not just one application domain to another, but also from one application use case to another, which is a distinction that has not been thoroughly addressed in the past. In addition, this study explores several application domains, providing a comprehensive breakdown of the categorical data consumed by these systems and the features, or outputs, of these systems. Without focusing on any particular journals or their rankings, this study collects and reviews articles on recommendation systems published from 2018 to April 2024, in four top-tier research repositories, including IEEE Xplore Digital Library, Springer Link, ACM Digital Library, and Google Scholar. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

Back to TopTop