Algorithms

Research

27 pages, 1314 KiB

Open AccessArticle

Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics

by Leonidas Theodorakopoulos, Aristeidis Karras and George A. Krimpas

Algorithms 2025, 18(2), 74; https://doi.org/10.3390/a18020074 (registering DOI) - 1 Feb 2025

Viewed by 173

In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and [...] Read more.

In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and parameter settings. The data were used to train predictive models that had up to 98% accuracy in forecasting performance. By building actionable predictive models, our research provides a unique treatment for key hyperparameter tuning, scalability, and real-time resource allocation challenges. Specifically, the practical value of traditional models in optimizing Apache Spark MLlib workflows was shown, achieving up to 30% resource savings and a 25% reduction in processing time. These models enable system optimization, reduce the amount of computational overheads, and boost the overall performance of big data applications. Ultimately, this work not only closes significant gaps in predictive performance modeling, but also paves the way for real-time analytics over a distributed environment. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

26 pages, 4970 KiB

Open AccessArticle

Do What You Say—Computing Personal Values Associated with Professions Based on the Words They Use

by Aditya Jha and Peter A. Gloor

Algorithms 2025, 18(2), 72; https://doi.org/10.3390/a18020072 (registering DOI) - 1 Feb 2025

Viewed by 306

Abstract

Members of a profession frequently show similar personality characteristics. In this research, we leverage recent advances in NLP to compute personal values using a moral values framework, distinguishing between four different personas that assist in categorizing different professions by personal values: “fatherlanders”—valuing tradition [...] Read more.

Members of a profession frequently show similar personality characteristics. In this research, we leverage recent advances in NLP to compute personal values using a moral values framework, distinguishing between four different personas that assist in categorizing different professions by personal values: “fatherlanders”—valuing tradition and authority, “nerds”—valuing scientific achievements, “spiritualists”—valuing compassion and non-monetary achievements, and “treehuggers”—valuing sustainability and the environment. We collected 200 YouTube videos and podcasts for each professional category of lawyers, academics, athletes, engineers, creatives, managers, and accountants, converting their audio to text. We also categorize these professions by team player personas into “bees”—collaborative creative team players, “ants”—competitive hard workers, and “leeches”—selfish egoists using pre-trained models. We find distinctive personal value profiles for each of our seven professions computed from the words that members of each profession use. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

18 pages, 7563 KiB

Open AccessArticle

Quantitative Analysis Using PMOD and FreeSurfer for Three Types of Radiopharmaceuticals for Alzheimer’s Disease Diagnosis

by Hyun Jin Yoon, Daye Yoon, Sungmin Jun, Young Jin Jeong and Do-Young Kang

Algorithms 2025, 18(2), 57; https://doi.org/10.3390/a18020057 - 21 Jan 2025

Viewed by 454

Abstract

In amyloid brain PET, after parcellation using the finite element method (FEM)-based algorithm FreeSurfer and voxel-based algorithm PMOD, SUVr examples can be extracted and compared. This study presents the classification SUVr threshold in PET images of F-18 florbetaben (FBB), F-18 flutemetamol (FMM), and [...] Read more.

In amyloid brain PET, after parcellation using the finite element method (FEM)-based algorithm FreeSurfer and voxel-based algorithm PMOD, SUVr examples can be extracted and compared. This study presents the classification SUVr threshold in PET images of F-18 florbetaben (FBB), F-18 flutemetamol (FMM), and F-18 florapronol (FPN) and compares and analyzes the classification performance according to computational algorithm in each brain region. PET images were co-registered after the generated MRI was registered with standard template information. Using MATLAB script, SUVr was calculated using the built-in parcellation number labeled in the brain region. PMOD and FreeSurfer with different algorithms were used to load the PET image, and after registration in MRI, it was normalized to the MRI template. The volume and SUVr of the individual gray matter space region were calculated using an automated anatomical labeling atlas. The SUVr values of eight regions of the frontal cortex (FC), lateral temporal cortex (LTC), mesial temporal cortex (MTC), parietal cortex (PC), occipital cortex (OC), anterior and posterior cingulate cortex (GCA, GCP), and composite were calculated. After calculating the correlation of SUVr using the FreeSurfer and PMOD algorithms and calculating the AUC for amyloid-positive/negative subjects, the classification ability was calculated, and the SVUr threshold was calculated using the Youden index. The correlation coefficients of FreeSurfer and PMOD SUVr calculations of the eight regions of the brain cortex were FBB (0.95), FMM (0.94), and FPN (0.91). The SUVr threshold was SUVr(LTC,min) = 1.264 and SUVr(THA,max) = 1.725 when calculated using FPN-FreeSurfer, and SUVr(MTC,min) = 1.093 and SUVr(MCT,max) = 1.564 when calculated using FPN-PMOD. The AUC comparison showed that there was no statistically significant difference (p > 0.05) in the SUVr classification results using the three radiopharmaceuticals, specifically for the LTC and OC regions in the PMOD analysis, and the LTC and PC regions in the FreeSurfer analysis. The SUVr calculation using PMOD (voxel-based algorithm) has a strong correlation with the calculation using FreeSurfer (FEM-based algorithm); therefore, they complement each other. Quantitative classification analysis with high accuracy is possible using the suggested SUVr threshold. The SUVr classification performance was good in the order of FMM, FBB, and FPN, and showed a good classification performance in the LTC region regardless of the type of radiotracer and analysis algorithm. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

20 pages, 254 KiB

Open AccessArticle

Code Obfuscation: A Comprehensive Approach to Detection, Classification, and Ethical Challenges

by Tomer Raitsis, Yossi Elgazari, Guy E. Toibin, Yotam Lurie, Shlomo Mark and Oded Margalit

Algorithms 2025, 18(2), 54; https://doi.org/10.3390/a18020054 - 21 Jan 2025

Viewed by 498

Abstract

Code obfuscation has become an essential practice in modern software development, designed to make source or machine code challenging for both humans and computers to comprehend. It plays a crucial role in cybersecurity by protecting intellectual property, safeguarding trade secrets, and preventing unauthorized [...] Read more.

Code obfuscation has become an essential practice in modern software development, designed to make source or machine code challenging for both humans and computers to comprehend. It plays a crucial role in cybersecurity by protecting intellectual property, safeguarding trade secrets, and preventing unauthorized access or reverse engineering. However, the lack of transparency in obfuscated code raises significant ethical concerns, including the potential for harmful or unethical uses such as hidden data collection, malicious features, back doors, and concealed vulnerabilities. These issues highlight the need for a balanced approach that ensures the protection of developers’ intellectual property while addressing ethical responsibilities related to user privacy, transparency, and societal impact. This paper investigates various code obfuscation techniques, their benefits, challenges, and practical applications, underscoring their relevance in contemporary software development. This study examines obfuscation methods and tools, evaluates machine learning models—including Random Forest, Gradient Boosting, and Support Vector Machine—and presents experimental results aimed at classifying obfuscated versus non-obfuscated files. Our findings demonstrate that these models achieve high accuracy in identifying obfuscation methods employed by tools such as Jlaive, Oxyry, PyObfuscate, Pyarmor, and py-obfuscator. This research also addresses emerging ethical concerns and proposes guidelines for a balanced, responsible approach to code obfuscation. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

16 pages, 239 KiB

Open AccessArticle

SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models

by Gazi Husain, Daniel Nasef, Rejath Jose, Jonathan Mayer, Molly Bekbolatova, Timothy Devine and Milan Toma

Algorithms 2025, 18(1), 37; https://doi.org/10.3390/a18010037 - 10 Jan 2025

Viewed by 605

Abstract

Class imbalance is a prevalent challenge in machine learning that arises from skewed data distributions in one class over another, causing models to prioritize the majority class and underperform on the minority classes. This bias can significantly undermine accurate predictions in real-world scenarios, [...] Read more.

Class imbalance is a prevalent challenge in machine learning that arises from skewed data distributions in one class over another, causing models to prioritize the majority class and underperform on the minority classes. This bias can significantly undermine accurate predictions in real-world scenarios, highlighting the importance of the robust handling of imbalanced data for dependable results. This study examines one such scenario of real-time monitoring systems for fall risk assessment in bedridden patients where class imbalance may compromise the effectiveness of machine learning. It compares the effectiveness of two resampling techniques, the Synthetic Minority Oversampling Technique (SMOTE) and SMOTE combined with Edited Nearest Neighbors (SMOTEENN), in mitigating class imbalance and improving predictive performance. Using a controlled sampling strategy across various instance levels, the performance of both methods in conjunction with decision tree regression, gradient boosting regression, and Bayesian regression models was evaluated. The results indicate that SMOTEENN consistently outperforms SMOTE in terms of accuracy and mean squared error across all sample sizes and models. SMOTEENN also demonstrates healthier learning curves, suggesting improved generalization capabilities, particularly for a sampling strategy with a given number of instances. Furthermore, cross-validation analysis reveals that SMOTEENN achieves higher mean accuracy and lower standard deviation compared to SMOTE, indicating more stable and reliable performance. These findings suggest that SMOTEENN is a more effective technique for handling class imbalance, potentially contributing to the development of more accurate and generalizable predictive models in various applications. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Graphical abstract

14 pages, 392 KiB

Open AccessArticle

Applying Recommender Systems to Predict Personalized Film Age Ratings for Parents

by Harris Papadakis, Paraskevi Fragopoulou and Costas Panagiotakis

Algorithms 2024, 17(12), 578; https://doi.org/10.3390/a17120578 - 14 Dec 2024

Viewed by 535

Abstract

A motion picture content rating system categorizes a film based on its appropriateness for various audiences, considering factors such as portrayals of sex, violence, substance abuse, profanity, and other elements typically considered unsuitable for children or adolescents. This rating is usually coupled with [...] Read more.

A motion picture content rating system categorizes a film based on its appropriateness for various audiences, considering factors such as portrayals of sex, violence, substance abuse, profanity, and other elements typically considered unsuitable for children or adolescents. This rating is usually coupled with a minimum desired age that the film is suitable for. In this work, we apply recommender systems to predict personalized film age ratings for parents. According to the proposed methodology, we reduce the personalized film age prediction problem to the classic item recommendation problem by applying a recommender system for each age film category. The recommender systems generate recommendations for each film age category. Finally, these recommendations are combined to provide the final age recommendation for the parent (user). The proposed methodology was applied to state-of-the-art recommender systems. In addition, we used them as baselines for comparing the direct application of a recommender system to the age prediction problem. This was achieved by treating each film as an item and assigning the given age as its rating. The experimental results highlight the efficiency of the proposed system when applied to a well-known real-world dataset. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Graphical abstract

16 pages, 361 KiB

Open AccessArticle

Stroke Dataset Modeling: Comparative Study of Machine Learning Classification Methods

by Kalina Kitova, Ivan Ivanov and Vincent Hooper

Algorithms 2024, 17(12), 571; https://doi.org/10.3390/a17120571 - 13 Dec 2024

Cited by 1 | Viewed by 838

Abstract

Stroke prediction is a vital research area due to its significant implications for public health. This comparative study offers a detailed evaluation of algorithmic methodologies and outcomes from three recent prominent studies on stroke prediction. Ivanov et al. tackled issues of imbalanced datasets [...] Read more.

Stroke prediction is a vital research area due to its significant implications for public health. This comparative study offers a detailed evaluation of algorithmic methodologies and outcomes from three recent prominent studies on stroke prediction. Ivanov et al. tackled issues of imbalanced datasets and algorithmic bias using deep learning techniques, achieving notable results with a 98% accuracy and a 97% recall rate. They utilized resampling methods to balance the classes and advanced imputation techniques to handle missing data, underscoring the critical role of data preprocessing in enhancing the performance of Support Vector Machines (SVMs). Hassan et al. addressed missing data and class imbalance using multiple imputations and the Synthetic Minority Oversampling Technique (SMOTE). They developed a Dense Stacking Ensemble (DSE) model with over 96% accuracy. Their results underscore the efficiency of ensemble learning techniques and imputation for handling imbalanced datasets in stroke prediction. Bathla et al. employed various classifiers and feature selection techniques, including SMOTE, for class balancing. Their Random Forest (RF) classifier, combined with Feature Importance (FI) selection, achieved an accuracy of 97.17%, illustrating the positive impact of RF and relevant feature selection on model performance. A comparative analysis indicated that Ivanov et al.’s method achieved the highest accuracy rate. However, the studies collectively highlight that the choice of models and techniques for stroke prediction should be tailored to the specific characteristics of the dataset used. This study emphasizes the importance of effective data management and model selection in enhancing predictive performance. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

21 pages, 2687 KiB

Open AccessArticle

A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery

by Rym Nassih and Abdelaziz Berrado

Algorithms 2024, 17(12), 565; https://doi.org/10.3390/a17120565 - 10 Dec 2024

Viewed by 595

Abstract

Machine-learning algorithms have made significant strides, achieving high accuracy in many applications. However, traditional models often need large datasets, as they typically peel substantial portions of the data in each iteration, complicating the development of a classifier without sufficient data. In critical fields [...] Read more.

Machine-learning algorithms have made significant strides, achieving high accuracy in many applications. However, traditional models often need large datasets, as they typically peel substantial portions of the data in each iteration, complicating the development of a classifier without sufficient data. In critical fields like healthcare, there is a growing need to identify and analyze small yet significant subgroups within data. To address these challenges, we introduce a novel classifier based on the patient rule-induction method (PRIM), a subgroup-discovery algorithm. PRIM finds rules by peeling minimal data at each iteration, enabling the discovery of highly relevant regions. Unlike traditional classifiers, PRIM requires experts to select input spaces manually. Our innovation transforms PRIM into an interpretable classifier by starting with random input space selections for each class, then pruning rules using metarules, and finally selecting definitive rules for the classifier. Tested against popular algorithms such as random forest, logistic regression, and XG-Boost, our random PRIM-based classifier (R-PRIM-Cl) demonstrates comparable robustness, superior interpretability, and the ability to handle categorical and numeric variables. It discovers more rules in certain datasets, making it especially valuable in fields where understanding the model’s decision-making process is as important as its predictive accuracy. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

16 pages, 2729 KiB

Open AccessArticle

Hybrid RFSVM: Hybridization of SVM and Random Forest Models for Detection of Fake News

by Deepali Goyal Dev and Vishal Bhatnagar

Algorithms 2024, 17(10), 459; https://doi.org/10.3390/a17100459 - 16 Oct 2024

Cited by 1 | Viewed by 1106

Abstract

The creation and spreading of fake information can be carried out very easily through the internet community. This pervasive escalation of fake news and rumors has an extremely adverse effect on the nation and society. Detecting fake news on the social web is [...] Read more.

The creation and spreading of fake information can be carried out very easily through the internet community. This pervasive escalation of fake news and rumors has an extremely adverse effect on the nation and society. Detecting fake news on the social web is an emerging topic in research today. In this research, the authors review various characteristics of fake news and identify research gaps. In this research, the fake news dataset is modeled and tokenized by applying term frequency and inverse document frequency (TFIDF). Several machine-learning classification approaches are used to compute evaluation metrics. The authors proposed hybridizing SVMs and RF classification algorithms for improved accuracy, precision, recall, and F1-score. The authors also show the comparative analysis of different types of news categories using various machine-learning models and compare the performance of the hybrid RFSVM. Comparative studies of hybrid RFSVM with different algorithms such as Random Forest (RF), naïve Bayes (NB), SVMs, and XGBoost have shown better results of around 8% to 16% in terms of accuracy, precision, recall, and F1-score. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

20 pages, 5263 KiB

Open AccessArticle

Correlation Analysis of Railway Track Alignment and Ballast Stiffness: Comparing Frequency-Based and Machine Learning Algorithms

by Saeed Mohammadzadeh, Hamidreza Heydari, Mahdi Karimi and Araliya Mosleh

Algorithms 2024, 17(8), 372; https://doi.org/10.3390/a17080372 - 22 Aug 2024

Viewed by 1237

Abstract

One of the primary challenges in the railway industry revolves around achieving a comprehensive and insightful understanding of track conditions. The geometric parameters and stiffness of railway tracks play a crucial role in condition monitoring as well as maintenance work. Hence, this study [...] Read more.

One of the primary challenges in the railway industry revolves around achieving a comprehensive and insightful understanding of track conditions. The geometric parameters and stiffness of railway tracks play a crucial role in condition monitoring as well as maintenance work. Hence, this study investigated the relationship between vertical ballast stiffness and the track longitudinal level. Initially, the ballast stiffness and track longitudinal level data were acquired through a series of experimental measurements conducted on a reference test track along the Tehran–Mashhad railway line, utilizing recording cars for geometric track and stiffness recordings. Subsequently, the correlation between the track longitudinal level and ballast stiffness was surveyed using both frequency-based techniques and machine learning (ML) algorithms. The power spectrum density (PSD) as a frequency-based technique was employed, alongside ML algorithms, including linear regression, decision trees, and random forests, for correlation mining analyses. The results showed a robust and statistically significant relationship between the vertical ballast stiffness and longitudinal levels of railway tracks. Specifically, the PSD data exhibited a considerable correlation, especially within the 1–4 rad/m wave number range. Furthermore, the data analyses conducted using ML methods indicated that the values of the root mean square error (RMSE) were about 0.05, 0.07, and 0.06 for the linear regression, decision tree, and random forest algorithms, respectively, demonstrating the adequate accuracy of ML-based approaches. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

21 pages, 597 KiB

Open AccessArticle

MVACLNet: A Multimodal Virtual Augmentation Contrastive Learning Network for Rumor Detection

by Xin Liu, Mingjiang Pang, Qiang Li, Jiehan Zhou, Haiwen Wang and Dawei Yang

Algorithms 2024, 17(5), 199; https://doi.org/10.3390/a17050199 - 8 May 2024

Cited by 1 | Viewed by 1568

Abstract

In today’s digital era, rumors spreading on social media threaten societal stability and individuals’ daily lives, especially multimodal rumors. Hence, there is an urgent need for effective multimodal rumor detection methods. However, existing approaches often overlook the insufficient diversity of multimodal samples in [...] Read more.

In today’s digital era, rumors spreading on social media threaten societal stability and individuals’ daily lives, especially multimodal rumors. Hence, there is an urgent need for effective multimodal rumor detection methods. However, existing approaches often overlook the insufficient diversity of multimodal samples in feature space and hidden similarities and differences among multimodal samples. To address such challenges, we propose MVACLNet, a Multimodal Virtual Augmentation Contrastive Learning Network. In MVACLNet, we first design a Hierarchical Textual Feature Extraction (HTFE) module to extract comprehensive textual features from multiple perspectives. Then, we fuse the textual and visual features using a modified cross-attention mechanism, which operates from different perspectives at the feature value level, to obtain authentic multimodal feature representations. Following this, we devise a Virtual Augmentation Contrastive Learning (VACL) module as an auxiliary training module. It leverages ground-truth labels and extra-generated virtual multimodal feature representations to enhance contrastive learning, thus helping capture more crucial similarities and differences among multimodal samples. Meanwhile, it performs a Kullback–Leibler (KL) divergence constraint between predicted probability distributions of the virtual multimodal feature representations and their corresponding virtual labels to help extract more content-invariant multimodal features. Finally, the authentic multimodal feature representations are input into a rumor classifier for detection. Experiments on two real-world datasets demonstrate the effectiveness and superiority of MVACLNet on multimodal rumor detection. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

18 pages, 1892 KiB

Open AccessArticle

Research on Efficient Feature Generation and Spatial Aggregation for Remote Sensing Semantic Segmentation

by Ruoyang Li, Shuping Xiong, Yinchao Che, Lei Shi, Xinming Ma and Lei Xi

Algorithms 2024, 17(4), 151; https://doi.org/10.3390/a17040151 - 4 Apr 2024

Viewed by 1783

Abstract

Semantic segmentation algorithms leveraging deep convolutional neural networks often encounter challenges due to their extensive parameters, high computational complexity, and slow execution. To address these issues, we introduce a semantic segmentation network model emphasizing the rapid generation of redundant features and multi-level spatial [...] Read more.

Semantic segmentation algorithms leveraging deep convolutional neural networks often encounter challenges due to their extensive parameters, high computational complexity, and slow execution. To address these issues, we introduce a semantic segmentation network model emphasizing the rapid generation of redundant features and multi-level spatial aggregation. This model applies cost-efficient linear transformations instead of standard convolution operations during feature map generation, effectively managing memory usage and reducing computational complexity. To enhance the feature maps’ representation ability post-linear transformation, a specifically designed dual-attention mechanism is implemented, enhancing the model’s capacity for semantic understanding of both local and global image information. Moreover, the model integrates sparse self-attention with multi-scale contextual strategies, effectively combining features across different scales and spatial extents. This approach optimizes computational efficiency and retains crucial information, enabling precise and quick image segmentation. To assess the model’s segmentation performance, we conducted experiments in Changge City, Henan Province, using datasets such as LoveDA, PASCAL VOC, LandCoverNet, and DroneDeploy. These experiments demonstrated the model’s outstanding performance on public remote sensing datasets, significantly reducing the parameter count and computational complexity while maintaining high accuracy in segmentation tasks. This advancement offers substantial technical benefits for applications in agriculture and forestry, including land cover classification and crop health monitoring, thereby underscoring the model’s potential to support these critical sectors effectively. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Graphical abstract

11 pages, 374 KiB

Open AccessCommunication

Numerical Algorithms in III–V Semiconductor Heterostructures

by Ioannis G. Tsoulos and V. N. Stavrou

Algorithms 2024, 17(1), 44; https://doi.org/10.3390/a17010044 - 19 Jan 2024

Viewed by 1824

Abstract

In the current research, we consider the solution of dispersion relations addressed to solid state physics by using artificial neural networks (ANNs). Most specifically, in a double semiconductor heterostructure, we theoretically investigate the dispersion relations of the interface polariton (IP) modes and describe [...] Read more.

In the current research, we consider the solution of dispersion relations addressed to solid state physics by using artificial neural networks (ANNs). Most specifically, in a double semiconductor heterostructure, we theoretically investigate the dispersion relations of the interface polariton (IP) modes and describe the reststrahlen frequency bands between the frequencies of the transverse and longitudinal optical phonons. The numerical results obtained by the aforementioned methods are in agreement with the results obtained by the recently published literature. Two methods were used to train the neural network: a hybrid genetic algorithm and a modified version of the well-known particle swarm optimization method. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

Journal Menu

Journal Browser

Algorithms in Data Classification (2nd Edition)

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issues

Published Papers (13 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI