Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 1 December 2024 | Viewed by 10709

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematics and Computer Science, University of Cagliari, 09124 Cagliari, Italy
Interests: data mining and machine learning; high-dimensional data analysis; feature selection
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In many real-world domains, the data distribution is highly imbalanced since instances of some classes appear much more frequently than others. This poses a difficulty for machine learning algorithms as they tend to be biased towards the majority class. At the same time, the minority class is typically the most important from a data mining perspective as it may carry valuable knowledge.

Despite more than two decades of continuous research, several open issues remain in the field of imbalance learning, and recent trends increasingly focus on the interaction between class imbalance and other difficulties embedded in the nature of the data, such as the fast-growing data volume and dimensionality, the variability of concepts in time, or the presence of noise and data quality issues. New real-world problems continue to emerge that motivate researchers to focus on advanced learning strategies, which can involve data-level and algorithm-level approaches, to effectively deal with imbalanced datasets.

The aim of this Special Issue is to bring together contributions that discuss problems and solutions in this area, both from a methodological and an application-oriented perspective. Topics of interest include but are not limited to:

  • Data-level, algorithm-level, and hybrid approaches;
  • Machine learning, ensemble learning, and deep learning methods;
  • Multi-label and multi-class imbalanced learning;
  • Learning strategies for high-dimensional imbalanced data;
  • Learning strategies for imbalanced data streams;
  • Learning strategies for imbalanced visual data;
  • Noise robustness of learning methods in imbalanced settings;
  • Metrics and methodologies for model evaluation in imbalanced settings;
  • Real-world applications: industrial monitoring systems, fraud detection, intrusion detection, software defect prediction, medical diagnosis, object detection and image classification, computer vision, text mining, sentiment analysis, anomaly detection, and behavior analysis in social media.

Dr. Barbara Pes
Dr. Andrea Loddo
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining and knowledge discovery
  • machine learning
  • deep learning
  • imbalance learning
  • case studies and real-world applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 590 KiB  
Article
SINNER: A Reward-Sensitive Algorithm for Imbalanced Malware Classification Using Neural Networks with Experience Replay
by Antonio Coscia, Andrea Iannacone, Antonio Maci and Alessandro Stamerra
Information 2024, 15(8), 425; https://doi.org/10.3390/info15080425 - 23 Jul 2024
Viewed by 1177
Abstract
Reports produced by popular malware analysis services showed a disparity in samples available for different malware families. The unequal distribution between such classes can be attributed to several factors, such as technological advances and the application domain that seeks to infect a computer [...] Read more.
Reports produced by popular malware analysis services showed a disparity in samples available for different malware families. The unequal distribution between such classes can be attributed to several factors, such as technological advances and the application domain that seeks to infect a computer virus. Recent studies have demonstrated the effectiveness of deep learning (DL) algorithms when learning multi-class classification tasks using imbalanced datasets. This can be achieved by updating the learning function such that correct and incorrect predictions performed on the minority class are more rewarded or penalized, respectively. This procedure can be logically implemented by leveraging the deep reinforcement learning (DRL) paradigm through a proper formulation of the Markov decision process (MDP). This paper proposes SINNER, i.e., a DRL-based multi-class classifier that approaches the data imbalance problem at the algorithmic level by exploiting a redesigned reward function, which modifies the traditional MDP model used to learn this task. Based on the experimental results, the proposed formula appears to be successful. In addition, SINNER has been compared to several DL-based models that can handle class skew without relying on data-level techniques. Using three out of four datasets sourced from the existing literature, the proposed model achieved state-of-the-art classification performance. Full article
Show Figures

Figure 1

19 pages, 25362 KiB  
Article
An Anomaly Detection Approach to Determine Optimal Cutting Time in Cheese Formation
by Andrea Loddo, Davide Ghiani, Alessandra Perniciano, Luca Zedda, Barbara Pes and Cecilia Di Ruberto
Information 2024, 15(6), 360; https://doi.org/10.3390/info15060360 - 18 Jun 2024
Viewed by 1147
Abstract
The production of cheese, a beloved culinary delight worldwide, faces challenges in maintaining consistent product quality and operational efficiency. One crucial stage in this process is determining the precise cutting time during curd formation, which significantly impacts the quality of the cheese. Misjudging [...] Read more.
The production of cheese, a beloved culinary delight worldwide, faces challenges in maintaining consistent product quality and operational efficiency. One crucial stage in this process is determining the precise cutting time during curd formation, which significantly impacts the quality of the cheese. Misjudging this timing can lead to the production of inferior products, harming a company’s reputation and revenue. Conventional methods often fall short of accurately assessing variations in coagulation conditions due to the inherent potential for human error. To address this issue, we propose an anomaly-detection-based approach. In this approach, we treat the class representing curd formation as the anomaly to be identified. Our proposed solution involves utilizing a one-class, fully convolutional data description network, which we compared against several state-of-the-art methods to detect deviations from the standard coagulation patterns. Encouragingly, our results show F1 scores of up to 0.92, indicating the effectiveness of our approach. Full article
Show Figures

Figure 1

15 pages, 1172 KiB  
Article
Prediction of Disk Failure Based on Classification Intensity Resampling
by Sheng Wu and Jihong Guan
Information 2024, 15(6), 322; https://doi.org/10.3390/info15060322 - 31 May 2024
Viewed by 615
Abstract
With the rapid growth of the data scale in data centers, the high reliability of storage is facing various challenges. Specifically, hardware failures such as disk faults occur frequently, causing serious system availability issues. In this context, hardware fault prediction based on AI [...] Read more.
With the rapid growth of the data scale in data centers, the high reliability of storage is facing various challenges. Specifically, hardware failures such as disk faults occur frequently, causing serious system availability issues. In this context, hardware fault prediction based on AI and big data technologies has become a research hotspot, aiming to guide operation and maintenance personnel to implement preventive replacement through accurate prediction to reduce hardware failure rates. However, existing methods still have weaknesses in terms of accuracy due to the impacts of data quality issues such as the sample imbalance. This article proposes a disk fault prediction method based on classification intensity resampling, which fills the gap between the degree of data imbalance and the actual classification intensity of the task by introducing a base classifier to calculate the classification intensity, thus better preserving the data features of the original dataset. In addition, using ensemble learning methods such as random forests, combined with resampling, an integrated classifier for imbalanced data is developed to further improve the prediction accuracy. Experimental verification shows that compared with traditional methods, the F1-score of disk fault prediction is improved by 6%, and the model training time is also greatly reduced. The fault prediction method proposed in this paper has been applied to approximately 80 disk drives and nearly 40,000 disks in the production environment of a large bank’s data center to guide preventive replacements. Compared to traditional methods, the number of preventive replacements based on our method has decreased by approximately 21%, while the overall disk failure rate remains unchanged, thus demonstrating the effectiveness of our method. Full article
Show Figures

Figure 1

13 pages, 1512 KiB  
Article
A Framework Model of Mining Potential Public Opinion Events Pertaining to Suspected Research Integrity Issues with the Text Convolutional Neural Network model and a Mixed Event Extractor
by Zongfeng Zou, Xiaochen Ji and Yingying Li
Information 2024, 15(6), 303; https://doi.org/10.3390/info15060303 - 24 May 2024
Viewed by 689
Abstract
With the development of the Internet, the oversight of research integrity issues has extended beyond the scientific community to encompass the whole of society. If these issues are not addressed promptly, they can significantly impact the research credibility of both institutions and scholars. [...] Read more.
With the development of the Internet, the oversight of research integrity issues has extended beyond the scientific community to encompass the whole of society. If these issues are not addressed promptly, they can significantly impact the research credibility of both institutions and scholars. This article proposes a text convolutional neural network based on SMOTE to identify short texts of potential public opinion events related to suspected scientific integrity issues from common short texts. The SMOTE comprehensive sampling technique is employed to handle imbalanced datasets. To mitigate the impact of short text length on text representation quality, the Doc2vec embedding model is utilized to represent short text, yielding a one-dimensional dense vector. Additionally, the dimensions of the input layer and convolution kernel of TextCNN are adjusted. Subsequently, a short text event extraction model based on TF-IDF and TextRank is proposed to extract crucial information, for instance, names and research-related institutions, from events and facilitate the identification of potential public opinion events related to suspected scientific integrity issues. Results of experiments have demonstrated that utilizing SMOTE to balance the dataset is able to improve the classification results of TextCNN classifiers. Compared to traditional classifiers, TextCNN exhibits greater robustness in addressing the problems of imbalanced datasets. However, challenges such as low information content, non-standard writing, and polysemy in short texts may impact the accuracy of event extraction. The framework can be further optimized to address these issues in the future. Full article
Show Figures

Figure 1

19 pages, 4471 KiB  
Article
Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem
by Siyoon Kim, Jeongmin Park, Hyun Ahn and Yonggeol Lee
Information 2024, 15(5), 265; https://doi.org/10.3390/info15050265 - 7 May 2024
Viewed by 1620
Abstract
In South Korea, the rapid proliferation of smartphones has led to an uptick in messenger phishing attacks associated with electronic communication financial scams. In response to this, various phishing detection algorithms have been proposed. However, collecting messenger phishing data poses challenges due to [...] Read more.
In South Korea, the rapid proliferation of smartphones has led to an uptick in messenger phishing attacks associated with electronic communication financial scams. In response to this, various phishing detection algorithms have been proposed. However, collecting messenger phishing data poses challenges due to concerns about its potential use in criminal activities. Consequently, a Korean phishing dataset can be composed of imbalanced data, where the number of general messages might outnumber the phishing ones. This class imbalance problem and data scarcity can lead to overfitting issues, making it difficult to achieve high performance. To solve this problem, this paper proposes a phishing messages classification method using Biased Discriminant Analysis without resorting to data augmentation techniques. In this paper, by optimizing the parameters for BDA, we achieved exceptionally high performances in the phishing messages classification experiment, with 95.45% for Recall and 96.85% for the BA metric. Moreover, when compared with other algorithms, the proposed method demonstrated robustness against overfitting due to the class imbalance problem and exhibited minimal performance disparity between training and testing datasets. Full article
Show Figures

Figure 1

20 pages, 541 KiB  
Article
An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data
by Ismael Ramos-Pérez, José Antonio Barbero-Aparicio, Antonio Canepa-Oneto, Álvar Arnaiz-González and Jesús Maudes-Raedo
Information 2024, 15(4), 223; https://doi.org/10.3390/info15040223 - 16 Apr 2024
Cited by 1 | Viewed by 1693
Abstract
The most common preprocessing techniques used to deal with datasets having high dimensionality and a low number of instances—or wide data—are feature reduction (FR), feature selection (FS), and resampling. This study explores the use of FR and resampling techniques, expanding the limited comparisons [...] Read more.
The most common preprocessing techniques used to deal with datasets having high dimensionality and a low number of instances—or wide data—are feature reduction (FR), feature selection (FS), and resampling. This study explores the use of FR and resampling techniques, expanding the limited comparisons between FR and filter FS methods in the existing literature, especially in the context of wide data. We compare the optimal outcomes from a previous comprehensive study of FS against new experiments conducted using FR methods. Two specific challenges associated with the use of FR are outlined in detail: finding FR methods that are compatible with wide data and the need for a reduction estimator of nonlinear approaches to process out-of-sample data. The experimental study compares 17 techniques, including supervised, unsupervised, linear, and nonlinear approaches, using 7 resampling strategies and 5 classifiers. The results demonstrate which configurations are optimal, according to their performance and computation time. Moreover, the best configuration—namely, k Nearest Neighbor (KNN) + the Maximal Margin Criterion (MMC) feature reducer with no resampling—is shown to outperform state-of-the-art algorithms. Full article
Show Figures

Figure 1

19 pages, 2175 KiB  
Article
An Evaluation of Feature Selection Robustness on Class Noisy Data
by Simone Pau, Alessandra Perniciano, Barbara Pes and Dario Rubattu
Information 2023, 14(8), 438; https://doi.org/10.3390/info14080438 - 3 Aug 2023
Cited by 1 | Viewed by 2060
Abstract
With the increasing growth of data dimensionality, feature selection has become a crucial step in a variety of machine learning and data mining applications. In fact, it allows identifying the most important attributes of the task at hand, improving the efficiency, interpretability, and [...] Read more.
With the increasing growth of data dimensionality, feature selection has become a crucial step in a variety of machine learning and data mining applications. In fact, it allows identifying the most important attributes of the task at hand, improving the efficiency, interpretability, and final performance of the induced models. In recent literature, several studies have examined the strengths and weaknesses of the available feature selection methods from different points of view. Still, little work has been performed to investigate how sensitive they are to the presence of noisy instances in the input data. This is the specific field in which our work wants to make a contribution. Indeed, since noise is arguably inevitable in several application scenarios, it would be important to understand the extent to which the different selection heuristics can be affected by noise, in particular class noise (which is more harmful in supervised learning tasks). Such an evaluation may be especially important in the context of class-imbalanced problems, where any perturbation in the set of training records can strongly affect the final selection outcome. In this regard, we provide here a two-fold contribution by presenting (i) a general methodology to evaluate feature selection robustness on class noisy data and (ii) an experimental study that involves different selection methods, both univariate and multivariate. The experiments have been conducted on eight high-dimensional datasets chosen to be representative of different real-world domains, with interesting insights into the intrinsic degree of robustness of the considered selection approaches. Full article
Show Figures

Figure 1

Back to TopTop