New Machine Learning and Deep Learning Techniques in Natural Language Processing

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (20 November 2024) | Viewed by 77578

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science, Faculty of Sciences, University of Craiova, A. I. Cuza, No.13, 200585 Craiova, Romania
Interests: machine learning; deep learning; computer vision; data mining; classification; evolutionary computation
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Natural language processing (NLP) represents a domain where machine learning has been highly applied in recent decades. NLP stopped representing a domain of interest exclusively for enthusiasts a long time ago: nowadays, as important global companies are interested in analyzing the sentiments from product or movie reviews, in automatically obtaining the opinions or themes of the discussions from users from social networks, it is crucial to detect fake news, and, as well as this, there are surely many brokers interested in predicting stock market trends by extracting the sentiments from financial news articles. Furthermore, there are definitely far more examples of NLP applications of high interest for the wider population.

Artificial intelligence has recently obtained an important boost with the rise of deep learning, and this fresh wave surely has had an impact on the applications of NLP. The new techniques based on deep learning models have surpassed, in most cases, traditional machine learning frameworks, although there are conventional approaches or new computational intelligence algorithms that still lead to state-of-the-art results in NLP applications.

The purpose of this Special Issue is to collect articles where the old and new challenges in NLP are treated by new approaches, either using traditional machine learning techniques or via deep learning approaches. Hybrid techniques, especially between machine learning and metaheuristics, towards improving the accuracy represent another type of framework that would be well-suited in the collection of papers gathered in this Special Issue. Moreover, the involvement of computational intelligence methods and algorithms, e.g., evolutionary algorithms (particularly genetic algorithms), swarm intelligence, as well as other nature and non-nature inspired metaheuristics-based approaches, for enhancing the results would represent another type of successful applications that would fit well within the papers of the Special Issue. As regards the type of NLP applications, we encourage submissions dealing with text classification, sentiment analysis, authorship attribution, text document clustering, detection of fake news, machine translation, text summarization, development of chatbots, grammar checking, and voice assistants, but we do not limit the examples to these.

We look forward to receiving your contributions.

Dr. Nebojsa Bacanin
Dr. Catalin Stoean
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language processing
  • sentiment analysis
  • market intelligence
  • fake news detection
  • grammar checking
  • machine translation
  • text summarization
  • machine learning and deep learning applications
  • hybrid approaches
  • swarm intelligence and evolutionary algorithms

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (25 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 1126 KiB  
Article
Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification
by Youngki Park and Youhyun Shin
Mathematics 2024, 12(19), 3090; https://doi.org/10.3390/math12193090 - 2 Oct 2024
Viewed by 982
Abstract
Can bi-encoders, without additional fine-tuning, achieve a performance comparable to fine-tuned BERT models in classification tasks? To answer this question, we present a simple yet effective approach to text classification using bi-encoders without the need for fine-tuning. Our main observation is that state-of-the-art [...] Read more.
Can bi-encoders, without additional fine-tuning, achieve a performance comparable to fine-tuned BERT models in classification tasks? To answer this question, we present a simple yet effective approach to text classification using bi-encoders without the need for fine-tuning. Our main observation is that state-of-the-art bi-encoders exhibit varying performance across different datasets. Therefore, our proposed approaches involve preparing multiple bi-encoders and, when a new dataset is provided, selecting and ensembling the most appropriate ones based on the dataset. Experimental results show that, for text classification tasks on subsets of the AG News, SMS Spam Collection, Stanford Sentiment Treebank v2, and TREC Question Classification datasets, the proposed approaches achieve performance comparable to fine-tuned BERT-Base, DistilBERT-Base, ALBERT-Base, and RoBERTa-Base. For instance, using the well-known bi-encoder model all-MiniLM-L12-v2 without additional optimization resulted in an average accuracy of 77.84%. This improved to 89.49% through the application of the proposed adaptive selection and ensemble techniques, and further increased to 91.96% when combined with the RoBERTa-Base model. We believe that this approach will be particularly useful in fields such as K-12 AI programming education, where pre-trained models are applied to small datasets without fine-tuning. Full article
Show Figures

Figure 1

45 pages, 2062 KiB  
Article
Exploring Metaheuristic Optimized Machine Learning for Software Defect Detection on Natural Language and Classical Datasets
by Aleksandar Petrovic, Luka Jovanovic, Nebojsa Bacanin, Milos Antonijevic, Nikola Savanovic, Miodrag Zivkovic, Marina Milovanovic and Vuk Gajic
Mathematics 2024, 12(18), 2918; https://doi.org/10.3390/math12182918 - 19 Sep 2024
Cited by 2 | Viewed by 998
Abstract
Software is increasingly vital, with automated systems regulating critical functions. As development demands grow, manual code review becomes more challenging, often making testing more time-consuming than development. A promising approach to improving defect detection at the source code level is the use of [...] Read more.
Software is increasingly vital, with automated systems regulating critical functions. As development demands grow, manual code review becomes more challenging, often making testing more time-consuming than development. A promising approach to improving defect detection at the source code level is the use of artificial intelligence combined with natural language processing (NLP). Source code analysis, leveraging machine-readable instructions, is an effective method for enhancing defect detection and error prevention. This work explores source code analysis through NLP and machine learning, comparing classical and emerging error detection methods. To optimize classifier performance, metaheuristic optimizers are used, and algorithm modifications are introduced to meet the study’s specific needs. The proposed two-tier framework uses a convolutional neural network (CNN) in the first layer to handle large feature spaces, with AdaBoost and XGBoost classifiers in the second layer to improve error identification. Additional experiments using term frequency–inverse document frequency (TF-IDF) encoding in the second layer demonstrate the framework’s versatility. Across five experiments with public datasets, the accuracy of the CNN was 0.768799. The second layer, using AdaBoost and XGBoost, further improved these results to 0.772166 and 0.771044, respectively. Applying NLP techniques yielded exceptional accuracies of 0.979781 and 0.983893 from the AdaBoost and XGBoost optimizers. Full article
Show Figures

Figure 1

21 pages, 4107 KiB  
Article
Sentiment Analysis: Predicting Product Reviews for E-Commerce Recommendations Using Deep Learning and Transformers
by Oumaima Bellar, Amine Baina and Mostafa Ballafkih
Mathematics 2024, 12(15), 2403; https://doi.org/10.3390/math12152403 - 2 Aug 2024
Viewed by 4305
Abstract
The abundance of publicly available data on the internet within the e-marketing domain is consistently expanding. A significant portion of this data revolve around consumers’ perceptions and opinions regarding the goods or services of organizations, making it valuable for market intelligence collectors in [...] Read more.
The abundance of publicly available data on the internet within the e-marketing domain is consistently expanding. A significant portion of this data revolve around consumers’ perceptions and opinions regarding the goods or services of organizations, making it valuable for market intelligence collectors in marketing, customer relationship management, and customer retention. Sentiment analysis serves as a tool for examining customer sentiment, marketing initiatives, and product appraisals. This valuable information can inform decisions related to future product and service development, marketing campaigns, and customer service enhancements. In social media, predicting ratings is commonly employed to anticipate product ratings based on user reviews. Our study provides an extensive benchmark comparison of different deep learning models, including convolutional neural networks (CNN), recurrent neural networks (RNN), and bi-directional long short-term memory (Bi-LSTM). These models are evaluated using various word embedding techniques, such as bi-directional encoder representations from transformers (BERT) and its derivatives, FastText, and Word2Vec. The evaluation encompasses two setups: 5-class versus 3-class. This paper focuses on sentiment analysis using neural network-based models for consumer sentiment prediction by evaluating and contrasting their performance indicators on a dataset of reviews of different products from customers of an online women’s clothes retailer. Full article
Show Figures

Figure 1

25 pages, 1187 KiB  
Article
Switch-Transformer Sentiment Analysis Model for Arabic Dialects That Utilizes a Mixture of Experts Mechanism
by Laith H. Baniata and Sangwoo Kang
Mathematics 2024, 12(2), 242; https://doi.org/10.3390/math12020242 - 11 Jan 2024
Cited by 4 | Viewed by 1727
Abstract
In recent years, models such as the transformer have demonstrated impressive capabilities in the realm of natural language processing. However, these models are known for their complexity and the substantial training they require. Furthermore, the self-attention mechanism within the transformer, designed to capture [...] Read more.
In recent years, models such as the transformer have demonstrated impressive capabilities in the realm of natural language processing. However, these models are known for their complexity and the substantial training they require. Furthermore, the self-attention mechanism within the transformer, designed to capture semantic relationships among words in sequences, faces challenges when dealing with short sequences. This limitation hinders its effectiveness in five-polarity Arabic sentiment analysis (SA) tasks. The switch-transformer model has surfaced as a potential substitute. Nevertheless, when employing one-task learning for their training, these models frequently face challenges in presenting exceptional performances and encounter issues when producing resilient latent feature representations, particularly in the context of small-size datasets. This challenge is particularly prominent in the case of the Arabic dialect, which is recognized as a low-resource language. In response to these constraints, this research introduces a novel method for the sentiment analysis of Arabic text. This approach leverages multi-task learning (MTL) in combination with the switch-transformer shared encoder to enhance model adaptability and refine sentence representations. By integrating a mixture of experts (MoE) technique that breaks down the problem into smaller, more manageable sub-problems, the model becomes skilled in managing extended sequences and intricate input–output relationships, thereby benefiting both five-point and three-polarity Arabic sentiment analysis tasks. The proposed model effectively identifies sentiment in Arabic dialect sentences. The empirical results underscore its exceptional performance, with accuracy rates reaching 84.02% for the HARD dataset, 67.89% for the BRAD dataset, and 83.91% for the LABR dataset, as demonstrated by the evaluations conducted on these datasets. Full article
Show Figures

Figure 1

22 pages, 2161 KiB  
Article
Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding
by Xiaoyu Ji, Wanyang Hu and Yanyan Liang
Mathematics 2023, 11(24), 4895; https://doi.org/10.3390/math11244895 - 7 Dec 2023
Viewed by 1076
Abstract
The MASSIVE dataset is a spoken-language comprehension resource package for slot filling, intent classification, and virtual assistant evaluation tasks. It contains multi-language utterances from human beings communicating with a virtual assistant. In this paper, we exploited the relationship between intent classification and slot [...] Read more.
The MASSIVE dataset is a spoken-language comprehension resource package for slot filling, intent classification, and virtual assistant evaluation tasks. It contains multi-language utterances from human beings communicating with a virtual assistant. In this paper, we exploited the relationship between intent classification and slot filling to improve the exact match accuracy by proposing five models with hierarchical and bidirectional architectures. There are two variants for hierarchical architectures and three variants for bidirectional architectures. These are the hierarchical concatenation model, the hierarchical attention-based model, the bidirectional max-pooling model, the bidirectional LSTM model, and the bidirectional attention-based model. The results of our models showed a significant improvement in the averaged exact match accuracy. The hierarchical attention-based model improved the accuracy by 1.01 points for the full training dataset. As for the zero-shot setup, we observed that the exact match accuracy increased from 53.43 to 53.91. In this study, we observed that, for multi-task problems, utilizing the relevance between different tasks can help in improving the model’s overall performance. Full article
Show Figures

Figure 1

16 pages, 683 KiB  
Article
An Adaptive Mixup Hard Negative Sampling for Zero-Shot Entity Linking
by Shisen Cai, Xi Wu, Maihemuti Maimaiti, Yichang Chen, Zhixiang Wang and Jiong Zheng
Mathematics 2023, 11(20), 4366; https://doi.org/10.3390/math11204366 - 20 Oct 2023
Viewed by 1284
Abstract
Recently, the focus of entity linking research has centered on the zero-shot scenario, where the entity purposed to be labeled at the time of testing was never observed during the training phase, or it may belong to a different domain than the source [...] Read more.
Recently, the focus of entity linking research has centered on the zero-shot scenario, where the entity purposed to be labeled at the time of testing was never observed during the training phase, or it may belong to a different domain than the source domain. Current studies have used BERT as the base encoder, as it effectively establishes distributional links between source and target domains. The currently available negative sampling methods all use an extractive approach, which makes it difficult for the models to learn diverse and more challenging negative samples. To address this problem, we propose a generative negative sampling method, adaptive_mixup_hard, which generates more difficult negative entities by fusing the features of both positive and negative samples on top of hard negative sampling and introduces a transformable adaptive parameter, W, to increase the diversity of negative samples. Next, we fuse our method with the Biencoder architecture and evaluate its performance under three different score functions. Ultimately, experimental results on the standard benchmark dataset, Zeshel, demonstrate the effectiveness of our method. Full article
Show Figures

Figure 1

19 pages, 731 KiB  
Article
An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction
by Junyi Chen, Xuanqing Zhang, Xiabing Zhou, Yingjie Han and Qinglei Zhou
Mathematics 2023, 11(9), 2032; https://doi.org/10.3390/math11092032 - 25 Apr 2023
Cited by 3 | Viewed by 1889
Abstract
Legal Judgment Prediction aims to automatically predict judgment outcomes based on descriptions of legal cases and established law articles, and has received increasing attention. In the preliminary work, several problems still have not been adequately solved. One is how to utilize limited but [...] Read more.
Legal Judgment Prediction aims to automatically predict judgment outcomes based on descriptions of legal cases and established law articles, and has received increasing attention. In the preliminary work, several problems still have not been adequately solved. One is how to utilize limited but valuable label information. Existing methods mostly ignore the gap between the description of established articles and cases, but directly integrate them. Second, most studies ignore the mutual constraint among the subtasks, such as logically or semantically, each charge is only related to some specific articles. To address these issues, we first construct a crime similarity graph and then perform a distillation operation to collect discriminate keywords for each charge. Furthermore, we fuse these discriminative keywords instead of established article descriptions into case embedding with a cross-attention mechanism to obtain deep semantic representations of cases incorporating label information. Finally, under a constraint among subtasks, we optimize the one-hot representation of ground-truth labels to guarantee consistent results across the subtasks based on the label-enhancement algorithm. To verify the effectiveness and robustness of our framework, we conduct extensive experiments on two public datasets. The experimental results show that the proposed method outperforms the state-of-art models by 3.89%/7.92% and 1.23%/2.50% in the average MF1-score of the subtasks on CAIL-Small/Big, respectively. Full article
Show Figures

Figure 1

21 pages, 3970 KiB  
Article
Web-Informed-Augmented Fake News Detection Model Using Stacked Layers of Convolutional Neural Network and Deep Autoencoder
by Abdullah Marish Ali, Fuad A. Ghaleb, Mohammed Sultan Mohammed, Fawaz Jaber Alsolami and Asif Irshad Khan
Mathematics 2023, 11(9), 1992; https://doi.org/10.3390/math11091992 - 23 Apr 2023
Cited by 6 | Viewed by 3219
Abstract
Today, fake news is a growing concern due to its devastating impacts on communities. The rise of social media, which many users consider the main source of news, has exacerbated this issue because individuals can easily disseminate fake news more quickly and inexpensive [...] Read more.
Today, fake news is a growing concern due to its devastating impacts on communities. The rise of social media, which many users consider the main source of news, has exacerbated this issue because individuals can easily disseminate fake news more quickly and inexpensive with fewer checks and filters than traditional news media. Numerous approaches have been explored to automate the detection and prevent the spread of fake news. However, achieving accurate detection requires addressing two crucial aspects: obtaining the representative features of effective news and designing an appropriate model. Most of the existing solutions rely solely on content-based features that are insufficient and overlapping. Moreover, most of the models used for classification are constructed with the concept of a dense features vector unsuitable for short news sentences. To address this problem, this study proposed a Web-Informed-Augmented Fake News Detection Model using Stacked Layers of Convolutional Neural Network and Deep Autoencoder called ICNN-AEN-DM. The augmented information is gathered from web searches from trusted sources to either support or reject the claims in the news content. Then staked layers of CNN with a deep autoencoder were constructed to train a probabilistic deep learning-base classifier. The probabilistic outputs of the stacked layers were used to train decision-making by staking multilayer perceptron (MLP) layers to the probabilistic deep learning layers. The results based on extensive experiments challenging datasets show that the proposed model performs better than the related work models. It achieves 26.6% and 8% improvement in detection accuracy and overall detection performance, respectively. Such achievements are promising for reducing the negative impacts of fake news on communities. Full article
Show Figures

Figure 1

20 pages, 669 KiB  
Article
A Concise Relation Extraction Method Based on the Fusion of Sequential and Structural Features Using ERNIE
by Yu Wang, Yuan Wang, Zhenwan Peng, Feifan Zhang and Fei Yang
Mathematics 2023, 11(6), 1439; https://doi.org/10.3390/math11061439 - 16 Mar 2023
Cited by 2 | Viewed by 2119
Abstract
Relation extraction, a fundamental task in natural language processing, aims to extract entity triples from unstructured data. These triples can then be used to build a knowledge graph. Recently, pre-training models that have learned prior semantic and syntactic knowledge, such as BERT and [...] Read more.
Relation extraction, a fundamental task in natural language processing, aims to extract entity triples from unstructured data. These triples can then be used to build a knowledge graph. Recently, pre-training models that have learned prior semantic and syntactic knowledge, such as BERT and ERNIE, have enhanced the performance of relation extraction tasks. However, previous research has mainly focused on sequential or structural data alone, such as the shortest dependency path, ignoring the fact that fusing sequential and structural features may improve the classification performance. This study proposes a concise approach using the fused features for the relation extraction task. Firstly, for the sequential data, we verify in detail which of the generated representations can effectively improve the performance. Secondly, inspired by the pre-training task of next-sentence prediction, we propose a concise relation extraction approach based on the fusion of sequential and structural features using the pre-training model ERNIE. The experiments were conducted on the SemEval 2010 Task 8 dataset and the results show that the proposed method can improve the F1 value to 0.902. Full article
Show Figures

Figure 1

29 pages, 416 KiB  
Article
It’s All in the Embedding! Fake News Detection Using Document Embeddings
by Ciprian-Octavian Truică and Elena-Simona Apostol
Mathematics 2023, 11(3), 508; https://doi.org/10.3390/math11030508 - 18 Jan 2023
Cited by 32 | Viewed by 6362
Abstract
With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and [...] Read more.
With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and malformation through the use of fake news. The emergence of this harmful phenomenon has managed to polarize society and manipulate public opinion on particular topics, e.g., elections, vaccinations, etc. Such information propagated on social media can distort public perceptions and generate social unrest while lacking the rigor of traditional journalism. Natural Language Processing and Machine Learning techniques are essential for developing efficient tools that can detect fake news. Models that use the context of textual data are essential for resolving the fake news detection problem, as they manage to encode linguistic features within the vector representation of words. In this paper, we propose a new approach that uses document embeddings to build multiple models that accurately label news articles as reliable or fake. We also present a benchmark on different architectures that detect fake news using binary or multi-labeled classification. We evaluated the models on five large news corpora using accuracy, precision, and recall. We obtained better results than more complex state-of-the-art Deep Neural Network models. We observe that the most important factor for obtaining high accuracy is the document encoding, not the classification model's complexity. Full article
Show Figures

Figure 1

20 pages, 1100 KiB  
Article
Contextual Urdu Lemmatization Using Recurrent Neural Network Models
by Rabab Hafeez, Muhammad Waqas Anwar, Muhammad Hasan Jamal, Tayyaba Fatima, Julio César Martínez Espinosa, Luis Alonso Dzul López, Ernesto Bautista Thompson and Imran Ashraf
Mathematics 2023, 11(2), 435; https://doi.org/10.3390/math11020435 - 13 Jan 2023
Cited by 4 | Viewed by 5034
Abstract
In the field of natural language processing, machine translation is a colossally developing research area that helps humans communicate more effectively by bridging the linguistic gap. In machine translation, normalization and morphological analyses are the first and perhaps the most important modules for [...] Read more.
In the field of natural language processing, machine translation is a colossally developing research area that helps humans communicate more effectively by bridging the linguistic gap. In machine translation, normalization and morphological analyses are the first and perhaps the most important modules for information retrieval (IR). To build a morphological analyzer, or to complete the normalization process, it is important to extract the correct root out of different words. Stemming and lemmatization are techniques commonly used to find the correct root words in a language. However, a few studies on IR systems for the Urdu language have shown that lemmatization is more effective than stemming due to infixes found in Urdu words. This paper presents a lemmatization algorithm based on recurrent neural network models for the Urdu language. However, lemmatization techniques for resource-scarce languages such as Urdu are not very common. The proposed model is trained and tested on two datasets, namely, the Urdu Monolingual Corpus (UMC) and the Universal Dependencies Corpus of Urdu (UDU). The datasets are lemmatized with the help of recurrent neural network models. The Word2Vec model and edit trees are used to generate semantic and syntactic embedding. Bidirectional long short-term memory (BiLSTM), bidirectional gated recurrent unit (BiGRU), bidirectional gated recurrent neural network (BiGRNN), and attention-free encoder–decoder (AFED) models are trained under defined hyperparameters. Experimental results show that the attention-free encoder-decoder model achieves an accuracy, precision, recall, and F-score of 0.96, 0.95, 0.95, and 0.95, respectively, and outperforms existing models. Full article
Show Figures

Figure 1

23 pages, 611 KiB  
Article
Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora
by Alexander Sboev, Roman Rybka, Anton Selivanov, Ivan Moloshnikov, Artem Gryaznov, Alexander Naumov, Sanna Sboeva, Gleb Rylkov and Soyora Zakirova
Mathematics 2023, 11(2), 354; https://doi.org/10.3390/math11020354 - 9 Jan 2023
Cited by 3 | Viewed by 1879
Abstract
An extraction of significant information from Internet sources is an important task of pharmacovigilance due to the need for post-clinical drugs monitoring. This research considers the task of end-to-end recognition of pharmaceutically significant named entities and their relations in texts in natural language. [...] Read more.
An extraction of significant information from Internet sources is an important task of pharmacovigilance due to the need for post-clinical drugs monitoring. This research considers the task of end-to-end recognition of pharmaceutically significant named entities and their relations in texts in natural language. The meaning of “end-to-end” is that both of the tasks are performed within a single process on the “raw” text without annotation. The study is based on the current version of the Russian Drug Review Corpus—a dataset of 3800 review texts from the Russian segment of the Internet. Currently, this is the only corpus in the Russian language appropriate for research of the mentioned type. We estimated the accuracy of the recognition of the pharmaceutically significant entities and their relations in two approaches based on neural-network language models. The first core approach is to sequentially solve tasks of named-entities recognition and relation extraction (the sequential approach). The second one solves both tasks simultaneously with a single neural network (the joint approach). The study includes a comparison of both approaches, along with the hyperparameters selection to maximize resulting accuracy. It is shown that both approaches solve the target task at the same level of accuracy: 52–53% macro-averaged F1-score, which is the current level of accuracy for “end-to-end” tasks on the Russian language. Additionally, the paper presents the results for English open datasets ADE and DDI based on the joint approach, and hyperparameter selection for the modern domain-specific language models. The result is that the achieved accuracies of 84.2% (ADE) and 73.3% (DDI) are comparable or better than other published results for the datasets. Full article
Show Figures

Figure 1

15 pages, 594 KiB  
Article
Robust Data Augmentation for Neural Machine Translation through EVALNET
by Yo-Han Park, Yong-Seok Choi, Seung Yun, Sang-Hun Kim and Kong-Joo Lee
Mathematics 2023, 11(1), 123; https://doi.org/10.3390/math11010123 - 27 Dec 2022
Cited by 3 | Viewed by 3110
Abstract
Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality [...] Read more.
Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality of augmented data. Several studies have addressed data weighting techniques to assess data quality. The basic idea of data weighting adopted in previous studies is the loss value that a system calculates when learning from training data. The weight derived from the loss value of the data, through simple heuristic rules or neural models, can adjust the loss used in the next step of the learning process. In this study, we propose EvalNet, a data evaluation network, to assess parallel data of NMT. EvalNet exploits a loss value, a cross-attention map, and a semantic similarity between parallel data as its features. The cross-attention map is an encoded representation of cross-attention layers of Transformer, which is a base architecture of an NMT system. The semantic similarity is a cosine distance between two semantic embeddings of a source sentence and a target sentence. Owing to the parallelism of data, the combination of the cross-attention map and the semantic similarity proved to be effective features for data quality evaluation, besides the loss value. EvalNet is the first NMT data evaluator network that introduces the cross-attention map and the semantic similarity as its features. Through various experiments, we conclude that EvalNet is simple yet beneficial for robust training of an NMT system and outperforms the previous studies as a data evaluator. Full article
Show Figures

Figure 1

31 pages, 6561 KiB  
Article
Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering
by Nebojsa Bacanin, Miodrag Zivkovic, Catalin Stoean, Milos Antonijevic, Stefana Janicijevic, Marko Sarac and Ivana Strumberger
Mathematics 2022, 10(22), 4173; https://doi.org/10.3390/math10224173 - 8 Nov 2022
Cited by 41 | Viewed by 5607
Abstract
Spam represents a genuine irritation for email users, since it often disturbs them during their work or free time. Machine learning approaches are commonly utilized as the engine of spam detection solutions, as they are efficient and usually exhibit a high degree of [...] Read more.
Spam represents a genuine irritation for email users, since it often disturbs them during their work or free time. Machine learning approaches are commonly utilized as the engine of spam detection solutions, as they are efficient and usually exhibit a high degree of classification accuracy. Nevertheless, it sometimes happens that good messages are labeled as spam and, more often, some spam emails enter into the inbox as good ones. This manuscript proposes a novel email spam detection approach by combining machine learning models with an enhanced sine cosine swarm intelligence algorithm to counter the deficiencies of the existing techniques. The introduced novel sine cosine was adopted for training logistic regression and for tuning XGBoost models as part of the hybrid machine learning-metaheuristics framework. The developed framework has been validated on two public high-dimensional spam benchmark datasets (CSDMC2010 and TurkishEmail), and the extensive experiments conducted have shown that the model successfully deals with high-degree data. The comparative analysis with other cutting-edge spam detection models, also based on metaheuristics, has shown that the proposed hybrid method obtains superior performance in terms of accuracy, precision, recall, f1 score, and other relevant classification metrics. Additionally, the empirically established superiority of the proposed method is validated using rigid statistical tests. Full article
Show Figures

Figure 1

15 pages, 279 KiB  
Article
An Approach Based on Semantic Relationship Embeddings for Text Classification
by Ana Laura Lezama-Sánchez, Mireya Tovar Vidal and José A. Reyes-Ortiz
Mathematics 2022, 10(21), 4161; https://doi.org/10.3390/math10214161 - 7 Nov 2022
Cited by 4 | Viewed by 2948
Abstract
Semantic relationships between words provide relevant information about the whole idea in the texts. Existing embedding representation models characterize each word as a vector of numbers with a fixed length. These models have been used in tasks involving text classification, such as recommendation [...] Read more.
Semantic relationships between words provide relevant information about the whole idea in the texts. Existing embedding representation models characterize each word as a vector of numbers with a fixed length. These models have been used in tasks involving text classification, such as recommendation and question–answer systems. However, the embedded information provided by semantic relationships has been neglected. Therefore, this paper proposes an approach that involves semantic relationships in embedding models for text classification, which is evaluated. Three embedding models based on semantic relations extracted from Wikipedia are presented and compared with existing word-based models. Our approach considers the following relationships: synonymy, hyponymy, and hyperonymy. They were considered since previous experiments have shown that they provide semantic knowledge. The relationships are extracted from Wikipedia using lexical-syntactic patterns identified in the literature. The extracted relationships are embedded as a vector: synonymy, hyponymy–hyperonymy, and a combination of all relationships. A Convolutional Neural Network using semantic relationship embeddings was trained for text classification. An evaluation was carried out for the proposed relationship embedding configurations and existing word-based models to compare them based on two corpora. The results were obtained with the metrics of precision, accuracy, recall, and F1-measure. The best results for the 20-Newsgroup corpus were obtained with the hyponymy–hyperonymy embeddings, achieving an accuracy of 0.79. For the Reuters corpus, F1-measure and recall of 0.87 were obtained using synonymy–hyponymy–hyperonymy. Full article
14 pages, 2972 KiB  
Article
Pseudocode Generation from Source Code Using the BART Model
by Anas Alokla, Walaa Gad, Waleed Nazih, Mustafa Aref and Abdel-badeeh Salem
Mathematics 2022, 10(21), 3967; https://doi.org/10.3390/math10213967 - 25 Oct 2022
Cited by 6 | Viewed by 4643
Abstract
In the software development process, more than one developer may work on developing the same program and bugs in the program may be fixed by a different developer; therefore, understanding the source code is an important issue. Pseudocode plays an important role in [...] Read more.
In the software development process, more than one developer may work on developing the same program and bugs in the program may be fixed by a different developer; therefore, understanding the source code is an important issue. Pseudocode plays an important role in solving this problem, as it helps the developer to understand the source code. Recently, transformer-based pre-trained models achieved remarkable results in machine translation, which is similar to pseudocode generation. In this paper, we propose a novel automatic pseudocode generation from the source code based on a pre-trained Bidirectional and Auto-Regressive Transformer (BART) model. We fine-tuned two pre-trained BART models (i.e., large and base) using a dataset containing source code and its equivalent pseudocode. In addition, two benchmark datasets (i.e., Django and SPoC) were used to evaluate the proposed model. The proposed model based on the BART large model outperforms other state-of-the-art models in terms of BLEU measurement by 15% and 27% for Django and SPoC datasets, respectively. Full article
Show Figures

Figure 1

25 pages, 3826 KiB  
Article
A Reverse Positional Encoding Multi-Head Attention-Based Neural Machine Translation Model for Arabic Dialects
by Laith H. Baniata, Sangwoo Kang and Isaac. K. E. Ampomah
Mathematics 2022, 10(19), 3666; https://doi.org/10.3390/math10193666 - 6 Oct 2022
Cited by 11 | Viewed by 2724
Abstract
Languages with a grammatical structure that have a free order for words, such as Arabic dialects, are considered a challenge for neural machine translation (NMT) models because of the attached suffixes, affixes, and out-of-vocabulary words. This paper presents a new reverse positional encoding [...] Read more.
Languages with a grammatical structure that have a free order for words, such as Arabic dialects, are considered a challenge for neural machine translation (NMT) models because of the attached suffixes, affixes, and out-of-vocabulary words. This paper presents a new reverse positional encoding mechanism for a multi-head attention (MHA) neural machine translation (MT) model to translate from right-to-left texts such as Arabic dialects (ADs) to modern standard Arabic (MSA). The proposed model depends on an MHA mechanism that has been suggested recently. The utilization of the new reverse positional encoding (RPE) mechanism and the use of sub-word units as an input to the self-attention layer improve this sublayer for the proposed model’s encoder by capturing all dependencies between the words in right-to-left texts, such as AD input sentences. Experiments were conducted on Maghrebi Arabic to MSA, Levantine Arabic to MSA, Nile Basin Arabic to MSA, Gulf Arabic to MSA, and Iraqi Arabic to MSA. Experimental analysis proved that the proposed reverse positional encoding MHA NMT model was efficiently able to handle the open grammatical structure issue of Arabic dialect sentences, and the proposed RPE MHA NMT model enhanced the translation quality for right-to-left texts such as Arabic dialects. Full article
Show Figures

Figure 1

17 pages, 2253 KiB  
Article
Synthetic Data Generator for Solving Korean Arithmetic Word Problem
by Kangmin Kim and Chanjun Chun
Mathematics 2022, 10(19), 3525; https://doi.org/10.3390/math10193525 - 27 Sep 2022
Cited by 1 | Viewed by 2229
Abstract
A math word problems (MWPs) comprises mathematical logic, numbers, and natural language. To solve these problems, a solver model requires an understanding of language and the ability to reason. Since the 1960s, research on the design of a model that provides automatic solutions [...] Read more.
A math word problems (MWPs) comprises mathematical logic, numbers, and natural language. To solve these problems, a solver model requires an understanding of language and the ability to reason. Since the 1960s, research on the design of a model that provides automatic solutions for mathematical problems has been continuously conducted, and numerous methods and datasets have been published. However, the published datasets in Korean are insufficient. In this study, we propose a Korean data generator for the first time to address this issue. The proposed data generator comprised problem types and data variations. Moreover, it has 4 problem types and 42 subtypes. The data variation has four categories, which adds robustness to the model. In total, 210,311 pieces of data were used for the experiment, of which 210,000 data points were generated. The training dataset had 150,000 data points. Each validation and test dataset had 30,000 data points. Furthermore, 311 problems were sourced from commercially available books on mathematical problems. We used these problems to evaluate the validity of our data generator on actual math word problems. The experiments confirm that models developed using the proposed data generator can be applied to real data. The proposed generator can be used to solve Korean MWPs in the field of education and the service industry, as well as serve as a basis for future research in this field. Full article
Show Figures

Figure 1

20 pages, 1212 KiB  
Article
Identification of Review Helpfulness Using Novel Textual and Language-Context Features
by Muhammad Shehrayar Khan, Atif Rizwan, Muhammad Shahzad Faisal, Tahir Ahmad, Muhammad Saleem Khan and Ghada Atteia
Mathematics 2022, 10(18), 3260; https://doi.org/10.3390/math10183260 - 7 Sep 2022
Cited by 1 | Viewed by 2724
Abstract
With the increase in users of social media websites such as IMDb, a movie website, and the rise of publicly available data, opinion mining is more accessible than ever. In the research field of language understanding, categorization of movie reviews can be challenging [...] Read more.
With the increase in users of social media websites such as IMDb, a movie website, and the rise of publicly available data, opinion mining is more accessible than ever. In the research field of language understanding, categorization of movie reviews can be challenging because human language is complex, leading to scenarios where connotation words exist. Connotation words have a different meaning than their literal meanings. While representing a word, the context in which the word is used changes the semantics of words. In this research work, categorizing movie reviews with good F-Measure scores has been investigated with Word2Vec and three different aspects of proposed features have been inspected. First, psychological features are extracted from reviews positive emotion, negative emotion, anger, sadness, clout (confidence level) and dictionary words. Second, readablility features are extracted; the Automated Readability Index (ARI), the Coleman Liau Index (CLI) and Word Count (WC) are calculated to measure the review’s understandability score and their impact on review classification performance is measured. Lastly, linguistic features are also extracted from reviews adjectives and adverbs. The Word2Vec model is trained on collecting 50,000 reviews related to movies. A self-trained Word2Vec model is used for the contextualized embedding of words into vectors with 50, 100, 150 and 300 dimensions.The pretrained Word2Vec model converts words into vectors with 150 and 300 dimensions. Traditional and advanced machine-learning (ML) algorithms are applied and evaluated according to performance measures: accuracy, precision, recall and F-Measure. The results indicate Support Vector Machine (SVM) using self-trained Word2Vec achieved 86% F-Measure and using psychological, linguistic and readability features with concatenation of Word2Vec features SVM achieved 87.93% F-Measure. Full article
Show Figures

Figure 1

17 pages, 810 KiB  
Article
Development of a Multilingual Model for Machine Sentiment Analysis in the Serbian Language
by Drazen Draskovic, Darinka Zecevic and Bosko Nikolic
Mathematics 2022, 10(18), 3236; https://doi.org/10.3390/math10183236 - 6 Sep 2022
Cited by 8 | Viewed by 2613
Abstract
In this research, a method of developing a machine model for sentiment processing in the Serbian language is presented. The Serbian language, unlike English and other popular languages, belongs to the group of languages with limited resources. Three different data sets were used [...] Read more.
In this research, a method of developing a machine model for sentiment processing in the Serbian language is presented. The Serbian language, unlike English and other popular languages, belongs to the group of languages with limited resources. Three different data sets were used as a data source: a balanced set of music album reviews, a balanced set of movie reviews, and a balanced set of music album reviews in English—MARD—which was translated into Serbian. The evaluation included applying developed models with three standard algorithms for classification problems (naive Bayes, logistic regression, and support vector machine) and applying a hybrid model, which produced the best results. The models were trained on each of the three data sets, while a set of music reviews originally written in Serbian was used for testing the model. By comparing the results of the developed model, the possibility of expanding the data set for the development of the machine model was also evaluated. Full article
Show Figures

Figure 1

23 pages, 2096 KiB  
Article
An Entity-Matching System Based on Multimodal Data for Two Major E-Commerce Stores in Mexico
by Raúl Estrada-Valenciano, Víctor Muñiz-Sánchez and Héctor De-la-Torre-Gutiérrez
Mathematics 2022, 10(15), 2564; https://doi.org/10.3390/math10152564 - 23 Jul 2022
Cited by 1 | Viewed by 2277
Abstract
E-commerce has grown considerably in Latin America in recent years due to the COVID-19 pandemic. E-commerce users in English-speaking and Chinese-speaking countries have web-based tools to compare the prices of products offered by various retailers. The task of product comparison is known as [...] Read more.
E-commerce has grown considerably in Latin America in recent years due to the COVID-19 pandemic. E-commerce users in English-speaking and Chinese-speaking countries have web-based tools to compare the prices of products offered by various retailers. The task of product comparison is known as entity matching in the data-science domain. This paper proposes the first entity-matching system for product comparison in Spanish-speaking e-commerce. Given the lack of uniformity of e-commerce sites in Mexico, we opted for a bimodal entity-matching system that uses the image and textual description of products from two of the largest e-commerce stores in Mexico. State-of-the-art techniques in natural language processing and machine learning were used to develop this research. The resulting system achieves F1 values of approximately 80%, representing a significant step towards consolidating a product-matching system in Spanish-speaking e-commerce. Full article
Show Figures

Figure 1

13 pages, 1971 KiB  
Article
Short Answer Detection for Open Questions: A Sequence Labeling Approach with Deep Learning Models
by Samuel González-López, Zeltzyn Guadalupe Montes-Rosales, Adrián Pastor López-Monroy, Aurelio López-López and Jesús Miguel García-Gorrostieta
Mathematics 2022, 10(13), 2259; https://doi.org/10.3390/math10132259 - 28 Jun 2022
Cited by 1 | Viewed by 1970
Abstract
Evaluating the response to open questions is a complex process since it requires prior knowledge of a specific topic and language. The computational challenge is to analyze the text by learning from a set of correct examples to train a model and then [...] Read more.
Evaluating the response to open questions is a complex process since it requires prior knowledge of a specific topic and language. The computational challenge is to analyze the text by learning from a set of correct examples to train a model and then predict unseen cases. Thus, we will be able to capture patterns that characterize answers to open questions. In this work, we used a sequence labeling and deep learning approach to detect if a text segment corresponds to the answer to an open question. We focused our efforts on analyzing the general objective of a thesis according to three methodological questions: Q1: What will be done? Q2: Why is it going to be done? Q3: How is it going to be done? First, we use the Beginning-Inside-Outside (BIO) format to label a corpus of targets with the help of two annotators. Subsequently, we adapted four state-of-the-art architectures to analyze the objective: Bidirectional Encoder Representations from Transformers (BERT-BETO) for Spanish, Code Switching Embeddings from Language Model (CS-ELMo), Multitask Neural Network (MTNN), and Bidirectional Long Short-Term Memory (Bi-LSTM). The results of the F-measure for detection of the answers to the three questions indicate that the BERT-BETO and CS-ELMo architecture obtained the best effectivity. The architecture that obtained the best results was BERT-BETO. BERT was the architecture that obtained more accurate results. The result of a detection analysis for Q1, Q2 and Q3 on a non-annotated corpus at the graduate and undergraduate levels is also reported. We found that for detecting the three questions, only the doctoral academic level reached 100%; that is, the doctoral objectives did contain the answer to the three questions. Full article
Show Figures

Figure 1

16 pages, 12417 KiB  
Article
Retrieval-Based Transformer Pseudocode Generation
by Anas Alokla, Walaa Gad, Waleed Nazih, Mustafa Aref and Abdel-Badeeh Salem
Mathematics 2022, 10(4), 604; https://doi.org/10.3390/math10040604 - 16 Feb 2022
Cited by 9 | Viewed by 3602
Abstract
The comprehension of source code is very difficult, especially if the programmer is not familiar with the programming language. Pseudocode explains and describes code contents that are based on the semantic analysis and understanding of the source code. In this paper, a novel [...] Read more.
The comprehension of source code is very difficult, especially if the programmer is not familiar with the programming language. Pseudocode explains and describes code contents that are based on the semantic analysis and understanding of the source code. In this paper, a novel retrieval-based transformer pseudocode generation model is proposed. The proposed model adopts different retrieval similarity methods and neural machine translation to generate pseudocode. The proposed model handles words of low frequency and words that do not exist in the training dataset. It consists of three steps. First, we retrieve the sentences that are similar to the input sentence using different similarity methods. Second, pass the source code retrieved (input retrieved) to the deep learning model based on the transformer to generate the pseudocode retrieved. Third, the replacement process is performed to obtain the target pseudo code. The proposed model is evaluated using Django and SPoC datasets. The experiments show promising performance results compared to other language models of machine translation. It reaches 61.96 and 50.28 in terms of BLEU performance measures for Django and SPoC, respectively. Full article
Show Figures

Figure 1

30 pages, 3430 KiB  
Article
ANA: Ant Nesting Algorithm for Optimizing Real-World Problems
by Deeam Najmadeen Hama Rashid, Tarik A. Rashid and Seyedali Mirjalili
Mathematics 2021, 9(23), 3111; https://doi.org/10.3390/math9233111 - 2 Dec 2021
Cited by 20 | Viewed by 3791
Abstract
In this paper, a novel swarm intelligent algorithm is proposed called ant nesting algorithm (ANA). The algorithm is inspired by Leptothorax ants and mimics the behavior of ants searching for positions to deposit grains while building a new nest. Although the algorithm is [...] Read more.
In this paper, a novel swarm intelligent algorithm is proposed called ant nesting algorithm (ANA). The algorithm is inspired by Leptothorax ants and mimics the behavior of ants searching for positions to deposit grains while building a new nest. Although the algorithm is inspired by the swarming behavior of ants, it does not have any algorithmic similarity with the ant colony optimization (ACO) algorithm. It is worth mentioning that ANA is considered a continuous algorithm that updates the search agent position by adding the rate of change (e.g., step or velocity). ANA computes the rate of change differently as it uses previous, current solutions, fitness values during the optimization process to generate weights by utilizing the Pythagorean theorem. These weights drive the search agents during the exploration and exploitation phases. The ANA algorithm is benchmarked on 26 well-known test functions, and the results are verified by a comparative study with genetic algorithm (GA), particle swarm optimization (PSO), dragonfly algorithm (DA), five modified versions of PSO, whale optimization algorithm (WOA), salp swarm algorithm (SSA), and fitness dependent optimizer (FDO). ANA outperformances these prominent metaheuristic algorithms on several test cases and provides quite competitive results. Finally, the algorithm is employed for optimizing two well-known real-world engineering problems: antenna array design and frequency-modulated synthesis. The results on the engineering case studies demonstrate the proposed algorithm’s capability in optimizing real-world problems. Full article
Show Figures

Figure 1

19 pages, 551 KiB  
Article
Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering
by Timea Bezdan, Catalin Stoean, Ahmed Al Naamany, Nebojsa Bacanin, Tarik A. Rashid, Miodrag Zivkovic and K. Venkatachalam
Mathematics 2021, 9(16), 1929; https://doi.org/10.3390/math9161929 - 13 Aug 2021
Cited by 87 | Viewed by 4747
Abstract
The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of [...] Read more.
The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods. Full article
Show Figures

Figure 1

Back to TopTop