applsci-logo

Journal Browser

Journal Browser

Natural Language Processing: Theory, Methods and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 April 2024) | Viewed by 7916

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science, University of Vigo, ESEI-Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004 Ourense, Spain
Interests: artificial intelligence; text mining; spam filtering
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
1. CINBIO, Department of Computer Science, ESEI—Escuela Superior de Ingeniería Informática, University of Vigo, 32004 Ourense, Spain
2. SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36213 Vigo, Spain
Interests: text mining; artificial intelligence; image processing machine learning; deep learning; big data
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

We are inviting submissions to this Special Issue on natural language processing.

This Special Issue is centered in the analysis of textual information in different contexts (health, e-mail classification, law analysis, etc.). New contributions to improve the current state of the art in this field or to explain the possible applications are welcome. Papers can also address issues about the application NLP techniques to develop specific solutions for making people's daily lives easier. Contributions may concern a wide variety of techniques including, but not limited to, the following: solutions based on any machine-learning (ML) technique (such as traditional ML models, deep-learning techniques or explainable artificial intelligence methodologies), word-embedding representations, the use of ontologies or ontology dictionaries, statistical techniques, etc.

The Special Issue is open for the publication of experimental work, properly validated designs for solutions, theoretical studies or state-of-the-art review papers.

Dr. José Ramón Méndez Reboredo
Dr. David Ruano-Ordás
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language processing
  • representation
  • information retrieval
  • text classification
  • semantic analysis
  • word sense disambiguation
  • clustering
  • intent detection

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 1109 KiB  
Article
The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model
by Sungsoon Jang, Yeseul Cho, Hyeonmin Seong, Taejong Kim and Hosung Woo
Appl. Sci. 2024, 14(13), 5682; https://doi.org/10.3390/app14135682 - 28 Jun 2024
Viewed by 1464
Abstract
Social network services and chatbots are susceptible to personal information leakage while facilitating language learning without time or space constraints. Accurate detection of personal information is paramount in avoiding such leaks. Conventionally named entity recognizers commonly used for this purpose often fail owing [...] Read more.
Social network services and chatbots are susceptible to personal information leakage while facilitating language learning without time or space constraints. Accurate detection of personal information is paramount in avoiding such leaks. Conventionally named entity recognizers commonly used for this purpose often fail owing to errors of unrecognition and misrecognition. Research in named entity recognition predominantly focuses on English, which poses challenges for non-English languages. By specifying procedures for the development of Korean-based tag sets, data collection, and preprocessing, we formulated directions on the application of entity recognition research to non-English languages. Such research could significantly benefit artificial intelligence (AI)-based natural language processing globally. We developed a personal information tag set comprising 33 items and established guidelines for dataset creation, later converting it into JSON format for AI learning. State-of-the-art AI models, BERT and ELECTRA, were employed to implement and evaluate the named entity recognition (NER) model, which achieved an 0.943 F1-score and outperformed conventional recognizers in detecting personal information. This advancement suggests that the proposed NER model can effectively prevent personal information leakage in systems processing interactive text data, marking a significant stride in safeguarding privacy across digital platforms. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

27 pages, 5204 KiB  
Article
AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language Processing
by Asmaa Alrayzah, Fawaz Alsolami and Mostafa Saleh
Appl. Sci. 2024, 14(12), 5294; https://doi.org/10.3390/app14125294 - 19 Jun 2024
Cited by 1 | Viewed by 1190
Abstract
The research presented in the following paper focuses on the effectiveness of a modern standard Arabic corpus, AraFast, in training transformer models for natural language processing tasks, particularly in Arabic. In the study described herein, four experiments were conducted to evaluate the use [...] Read more.
The research presented in the following paper focuses on the effectiveness of a modern standard Arabic corpus, AraFast, in training transformer models for natural language processing tasks, particularly in Arabic. In the study described herein, four experiments were conducted to evaluate the use of AraFast across different configurations: segmented, unsegmented, and mini versions. The main outcomes of the present study are as follows: Transformer models trained with larger and cleaner versions of AraFast, especially in question-answering, indicate the impact of corpus quality and size on model efficacy. Secondly, a dramatic reduction in training loss was observed with the mini version of AraFast, underscoring the importance of optimizing corpus size for effective training. Moreover, the segmented text format led to a decrease in training loss, highlighting segmentation as a beneficial strategy in Arabic NLP. In addition, using the study findings, challenges in managing noisy data derived from web sources are identified, which were found to significantly hinder model performance. These findings collectively demonstrate the critical role of well-prepared, segmented, and clean corpora in advancing Arabic NLP capabilities. The insights from AraFast’s application can guide the development of more efficient NLP models and suggest directions for future research in enhancing Arabic language processing tools. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

13 pages, 772 KiB  
Article
A Mongolian–Chinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation
by Huinuan Zhang, Yatu Ji, Nier Wu and Min Lu
Appl. Sci. 2024, 14(8), 3442; https://doi.org/10.3390/app14083442 - 19 Apr 2024
Viewed by 940
Abstract
Neural machine translation (NMT) typically relies on a substantial number of bilingual parallel corpora for effective training. Mongolian, as a low-resource language, has relatively few parallel corpora, resulting in poor translation performance. Data augmentation (DA) is a practical and promising method to solve [...] Read more.
Neural machine translation (NMT) typically relies on a substantial number of bilingual parallel corpora for effective training. Mongolian, as a low-resource language, has relatively few parallel corpora, resulting in poor translation performance. Data augmentation (DA) is a practical and promising method to solve problems related to data sparsity and single semantic structure by expanding the size and structure of available data. In order to address the issues of data sparsity and semantic inconsistency in Mongolian–Chinese NMT processes, this paper proposes a new semantic-context DA method. This method adds an additional semantic encoder based on the original translation model, which utilizes both source and target sentences to generate different semantic vectors to enhance each training instance. The results show that this method significantly improves the quality of Mongolian–Chinese NMT tasks, with an increase of approximately 2.5 BLEU values compared to the basic Transformer model. Compared to the basic model, this method can achieve the same translation results with about half of the data, greatly improving translation efficiency. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

15 pages, 1391 KiB  
Article
Named Entity Recognition in Government Audit Texts Based on ChineseBERT and Character-Word Fusion
by Baohua Huang, Yunjie Lin, Si Pang and Long Fu
Appl. Sci. 2024, 14(4), 1425; https://doi.org/10.3390/app14041425 - 9 Feb 2024
Cited by 1 | Viewed by 1103
Abstract
Named entity recognition of government audit text is a key task of intelligent auditing. Aiming at the problems of scarcity of corpus in the field of governmental auditing, insufficient utilization of traditional character vector word-level information features, and insufficient capturing of auditing entity [...] Read more.
Named entity recognition of government audit text is a key task of intelligent auditing. Aiming at the problems of scarcity of corpus in the field of governmental auditing, insufficient utilization of traditional character vector word-level information features, and insufficient capturing of auditing entity features, this study builds its own dataset in the field of auditing and proposes the model CW-CBGC for recognizing named entities in governmental auditing text based on ChineseBERT and character-word fusion. First, the ChineseBERT pre-training model is used to extract the character vector that integrates the features of glyph and pinyin, combining with word vectors dynamically constructed by the BERT pre-training model; then, the sequences of character-word fusion vectors are input into the bi-directional gated recurrent neural network (BiGRU) to learn the textual features. Finally, the global optimal sequence label is generated by Conditional Random Field (CRF), and the GHM classification loss function is used in the model training to solve the problem of error evaluation under the conditions of noisy entities and unbalanced number of entities. The F1 value of this study’s model on the audit dataset is 97.23%, which is 3.64% higher than the baseline model’s F1 value; the F1 value of the model on the public dataset Resume is 96.26%, which is 0.73–2.78% higher than the mainstream model. The experimental results show that the model proposed in this paper can effectively recognize the entities in government audit texts and has certain generalization ability. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

12 pages, 301 KiB  
Article
An Approach to a Linked Corpus Creation for a Literary Heritage Based on the Extraction of Entities from Texts
by Kenan Kassab and Nikolay Teslya
Appl. Sci. 2024, 14(2), 585; https://doi.org/10.3390/app14020585 - 9 Jan 2024
Cited by 1 | Viewed by 1113
Abstract
Working with the literary heritage of writers requires the studying of a large amount of materials. Finding them can take a considerable amount of time even when using search engines. The solution to this problem is to create a linked corpus of literary [...] Read more.
Working with the literary heritage of writers requires the studying of a large amount of materials. Finding them can take a considerable amount of time even when using search engines. The solution to this problem is to create a linked corpus of literary heritage. Texts in such a corpus will be united by common entities, which will make it possible to select texts not only by the occurrence of certain phrases in a query but also by common entities. To solve this problem, we propose the use of a Named Entity Recognition model trained on examples from a corpus of texts and a database structure for storing connections between texts. We propose to automate the process of creating a dataset for training a BERT-based NER model. Due to the specifics of the subject area, methods, techniques, and strategies are proposed to increase the accuracy of the model trained with a small set of examples. As a result, we created a dataset and a model trained on it which showed high accuracy in recognizing entities in the text (the average F1-score for all entity types is 0.8952). The database structure provides for the storage of unique entities and their relationships with texts and a selection of texts based on the entities. The method was tested for a corpus of texts from the literary heritage of Alexander Sergeevich Pushkin, which is also a difficult task due to the specifics of the Russian language. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

16 pages, 6689 KiB  
Article
The Question of Studying Information Entropy in Poetic Texts
by Olga Kozhemyakina, Vladimir Barakhnin, Natalia Shashok and Elina Kozhemyakina
Appl. Sci. 2023, 13(20), 11247; https://doi.org/10.3390/app132011247 - 13 Oct 2023
Viewed by 1221
Abstract
One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” [...] Read more.
One of the approaches to quantitative text analysis is to represent a given text in the form of a time series, which can be followed by an information entropy study for different text representations, such as “symbolic entropy”, “phonetic entropy” and “emotional entropy” of various orders. Studying authors’ styles based on such entropic characteristics of their works seems to be a promising area in the field of information analysis. In this work, the calculations of entropy values of the first, second and third order for the corpus of poems by A.S. Pushkin and other poets from the Golden Age of Russian Poetry were carried out. The values of “symbolic entropy”, “phonetic entropy” and “emotional entropy” and their mathematical expectations and variances were calculated for given corpora using the software application that automatically extracts statistical information, which is potentially applicable to tasks that identify features of the author’s style. The statistical data extracted could become the basis of the stylometric classification of authors by entropy characteristics. Full article
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)
Show Figures

Figure 1

Back to TopTop