Predictive Analytics and Data Science

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Processes".

Deadline for manuscript submissions: closed (20 April 2023) | Viewed by 99597

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science and Systems Technology, University of Pannonia, Veszprém, Hungary
Interests: machine learning; data mining; predictive analytics; artificial intelligence; neural networks; network analysis; data science; healthcare data mining
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
MTA-PE Lendület Complex Systems Monitoring Research Group, Department of Process Engineering, University of Pannonia, H-8200 Veszprém, Hungary
Interests: chemical engineering; complex systems; computational intelligence; network science; process engineering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The development and maintenance of predictive data-driven models poses several challenges, such as feature selection, model structure optimization, sensitivity analysis, model validation, model maintenance, transfer learning and adaptation, model deployment, and evaluation of the benefit of the application of the models.

This Special Issue solicits papers covering the development, validation, application, and maintenance of predictive analytics models and presenting real-life applications. The potential topics include but are not limited to:

  • Classification-based prediction models;
  • Regression-based prediction models;
  • Forecast using deep learning methods and algorithms;
  • Managing the uncertainty and missing data in forecast;
  • The life cycle of predictive models, and maintaining predictive models;
  • Development and validation of online predictive models;
  • Self-learning predictive models;
  • Predictive analytics in Industry 4.0 (application of sensors, historical experience);
  • Predictive analysis in healthcare and economy (e.g. patient pathway prediction, predicting complications, customer relationship management, risk reduction, churn prevention, market trend and analysis, credit scoring);
  • Social media and text analysis-based predictive models and systems.

Dr. Agnes Vathy-Fogarassy
Prof. Dr. Janos Abonyi
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • classification
  • regression
  • deep learning
  • uncertainty
  • validation and maintenance
  • self-learning
  • real-life applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Related Special Issue

Published Papers (18 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

26 pages, 3230 KiB  
Article
Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation
by Konstantinos Charmanas, Nikolaos Mittas and Lefteris Angelis
Information 2023, 14(7), 403; https://doi.org/10.3390/info14070403 - 14 Jul 2023
Cited by 1 | Viewed by 3106
Abstract
Security vulnerabilities constitute one of the most important weaknesses of hardware and software security that can cause severe damage to systems, applications, and users. As a result, software vendors should prioritize the most dangerous and impactful security vulnerabilities by developing appropriate countermeasures. As [...] Read more.
Security vulnerabilities constitute one of the most important weaknesses of hardware and software security that can cause severe damage to systems, applications, and users. As a result, software vendors should prioritize the most dangerous and impactful security vulnerabilities by developing appropriate countermeasures. As we acknowledge the importance of vulnerability prioritization, in the present study, we propose a framework that maps newly disclosed vulnerabilities with topic distributions, via word clustering, and further predicts whether this new entry will be associated with a potential exploit Proof Of Concept (POC). We also provide insights on the current most exploitable weaknesses and products through a Generalized Linear Model (GLM) that links the topic memberships of vulnerabilities with exploit indicators, thus distinguishing five topics that are associated with relatively frequent recent exploits. Our experiments show that the proposed framework can outperform two baseline topic modeling algorithms in terms of topic coherence by improving LDA models by up to 55%. In terms of classification performance, the conducted experiments—on a quite balanced dataset (57% negative observations, 43% positive observations)—indicate that the vulnerability descriptions can be used as exclusive features in assessing the exploitability of vulnerabilities, as the “best” model achieves accuracy close to 87%. Overall, our study contributes to enabling the prioritization of vulnerabilities by providing guidelines on the relations between the textual details of a weakness and the potential application/system exploits. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

13 pages, 9387 KiB  
Article
An Intelligent Boosting and Decision-Tree-Regression-Based Score Prediction (BDTR-SP) Method in the Reform of Tertiary Education Teaching
by Ling Zhu, Guangyu Liu, Shuang Lv, Dongjie Chen, Zhihong Chen and Xiang Li
Information 2023, 14(6), 317; https://doi.org/10.3390/info14060317 - 30 May 2023
Cited by 1 | Viewed by 1843
Abstract
The reform of tertiary education teaching promotes teachers to adjust timely teaching plans based on students’ learning feedback in order to improve teaching performance. Thefore, learning score prediction is a key issue in process of the reform of tertiary education teaching. With the [...] Read more.
The reform of tertiary education teaching promotes teachers to adjust timely teaching plans based on students’ learning feedback in order to improve teaching performance. Thefore, learning score prediction is a key issue in process of the reform of tertiary education teaching. With the development of information and management technologies, a lot of teaching data are generated as the scale of online and offline education expands. However, a teacher or educator does not have a comprehensive dataset in practice, which challenges his/her ability to predict the students’ learning performance from the individual’s viewpoint. How to overcome the drawbacks of small samples is an open issue. To this end, it is desirable that an effective artificial intelligent tool is designed to help teachers or educators predict students’ scores well. We propose a boosting and decision-tree-regression-based score prediction (BDTR-SP) model, which relies on an ensemble learning structure with base learners of decision tree regression (DTR) to improve the prediction accuracy. Experiments on small samples are conducted to examine the important features that affect students’ scores. The results show that the proposed model has advantages over its peer in terms of prediction correctness. Moreover, the predicted results are consistent with the actual facts implied in the original dataset. The proposed BDTR-SP method aids teachers and students to predict students’ performance in the on-going courses in order to adjust the teaching and learning strategies, plans and practices in advance, enhancing the teaching and learning quality. Therefore, the integration of information technology and artificial intelligence into teaching and learning practices is able to push forward the reform of tertiary education teaching. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

15 pages, 1454 KiB  
Article
Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
by Ashokkumar Palanivinayagam and Robertas Damaševičius
Information 2023, 14(2), 92; https://doi.org/10.3390/info14020092 - 3 Feb 2023
Cited by 24 | Viewed by 7769
Abstract
The existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the [...] Read more.
The existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the missing values. Additionally, we propose a two-level classification process to reduce the number of false classifications. Our evaluation of the proposed method was conducted using the PIMA Indian dataset for diabetes classification. We compared the performance of five different machine learning models: Naive Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Random Forest (RF), and Linear Regression (LR). The results of our experiments show that the SVM classifier achieved the highest accuracy of 94.89%. The RF classifier had the highest precision (98.80%) and the SVM classifier had the highest recall (85.48%). The NB model had the highest F1-Score (95.59%). Our proposed method provides a promising solution for detecting diabetes at an early stage by addressing the issue of missing values in the dataset. Our results show that the use of SVM regression and a two-level classification process can notably improve the performance of machine learning models for diabetes classification. This work provides a valuable contribution to the field of diabetes research and highlights the importance of addressing missing values in machine learning applications. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

15 pages, 476 KiB  
Article
A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining
by Tarid Wongvorachan, Surina He and Okan Bulut
Information 2023, 14(1), 54; https://doi.org/10.3390/info14010054 - 16 Jan 2023
Cited by 92 | Viewed by 25549
Abstract
Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as [...] Read more.
Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

25 pages, 4390 KiB  
Article
Spot Welding Parameter Tuning for Weld Defect Prevention in Automotive Production Lines: An ML-Based Approach
by Musa Bayır, Ertuğrul Yücel, Tolga Kaya and Nihan Yıldırım
Information 2023, 14(1), 50; https://doi.org/10.3390/info14010050 - 13 Jan 2023
Cited by 2 | Viewed by 3602
Abstract
Spot welding is a critical joining process which presents specific challenges in early defect detection, has high rework costs, and consumes excessive amounts of materials, hindering effective, sustainable production. Especially in automotive manufacturing, the welding source’s quality needs to be controlled to increase [...] Read more.
Spot welding is a critical joining process which presents specific challenges in early defect detection, has high rework costs, and consumes excessive amounts of materials, hindering effective, sustainable production. Especially in automotive manufacturing, the welding source’s quality needs to be controlled to increase the efficiency and sustainable performance of the production lines. Using data analytics, manufacturing companies can control and predict the welding parameters causing problems related to resource quality and process performance. In this study, we aimed to define the root cause of welding defects and solve the welding input value range problem using machine learning algorithms. In an automotive production line application, we analyzed real-time IoT data and created variables regarding the best working range of welding input parameters required in the inference analysis for expulsion reduction. The results will help to provide guidelines and parameter selection approaches to model ML-based solutions for the optimization problems associated with welding. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

11 pages, 1353 KiB  
Article
Research on Apparel Retail Sales Forecasting Based on xDeepFM-LSTM Combined Forecasting Model
by Tian Luo, Daofang Chang and Zhenyu Xu
Information 2022, 13(10), 497; https://doi.org/10.3390/info13100497 - 15 Oct 2022
Cited by 7 | Viewed by 2688
Abstract
Accurate sales forecasting can provide a scientific basis for the management decisions of enterprises. We proposed the xDeepFM-LSTM combined forecasting model for the characteristics of sales data of apparel retail enterprises. We first used the Extreme Deep Factorization Machine (xDeepFM) model to explore [...] Read more.
Accurate sales forecasting can provide a scientific basis for the management decisions of enterprises. We proposed the xDeepFM-LSTM combined forecasting model for the characteristics of sales data of apparel retail enterprises. We first used the Extreme Deep Factorization Machine (xDeepFM) model to explore the correlation between the sales influencing features as much as possible, and then modeled the sales prediction. Next, we used the Long Short-Term Memory (LSTM) model for residual correction to improve the accuracy of the prediction model. We then designed and implemented comparison experiments between the combined xDeepFM-LSTM forecasting model and other forecasting models. The experimental results show that the forecasting performance of xDeepFM-LSTM is significantly better than other forecasting models. Compared with the xDeepFM forecasting model, the combined forecasting model has a higher optimization rate, which provides a scientific basis for apparel companies to make adjustments to adjust their demand plans. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

14 pages, 877 KiB  
Article
Statistical Machine Learning Regression Models for Salary Prediction Featuring Economy Wide Activities and Occupations
by Yasser T. Matbouli and Suliman M. Alghamdi
Information 2022, 13(10), 495; https://doi.org/10.3390/info13100495 - 12 Oct 2022
Cited by 9 | Viewed by 7632
Abstract
A holistic occupational and economy-wide framework for salary prediction is developed and tested using statistical machine learning (ML). Predictive models are developed based on occupational features and organizational characteristics. Five different supervised ML algorithms are trained using survey data from the Saudi Arabian [...] Read more.
A holistic occupational and economy-wide framework for salary prediction is developed and tested using statistical machine learning (ML). Predictive models are developed based on occupational features and organizational characteristics. Five different supervised ML algorithms are trained using survey data from the Saudi Arabian labor market to estimate mean annual salary across economic activities and major occupational groups. In predicting the mean salary over economic activities, the Bayesian Gaussian process regression ML showed a marked improvement in R2 over multiple linear regression (from 0.50 to 0.98). Moreover, lower error levels were obtained: root-mean-square error was reduced by 80% and mean absolute error was reduced by almost 90% compared to multiple linear regression. However, the salary prediction over major occupational groups resulted in artificial neural networks performing the best in terms of both R2, with an improvement from 0.62 in multiple linear regression to 0.94 and errors were reduced by approximately 60%. The proposed framework can help estimate annual salary levels across different types of economic activities and organization sizes, as well as different occupations. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

19 pages, 3029 KiB  
Article
Artificial Neural Network Training Using Structural Learning with Forgetting for Parameter Analysis of Injection Molding Quality Prediction
by Muhammad Rifqi Maarif, R. Faiz Listyanda, Yong-Shin Kang and Muhammad Syafrudin
Information 2022, 13(10), 488; https://doi.org/10.3390/info13100488 - 10 Oct 2022
Cited by 10 | Viewed by 3183
Abstract
The analysis of influential machine parameters can be useful to plan and design a plastic injection molding process. However, current research in parameter analysis is mostly based on computer-aided engineering (CAE) or simulation which have been demonstrated to be inadequate for analyzing complex [...] Read more.
The analysis of influential machine parameters can be useful to plan and design a plastic injection molding process. However, current research in parameter analysis is mostly based on computer-aided engineering (CAE) or simulation which have been demonstrated to be inadequate for analyzing complex behavioral changes in the real injection molding process. More advanced approaches using machine learning technology specifically with artificial neural networks (ANNs) brought promising results in terms of prediction accuracy. Nevertheless, the black box and distributed representation of ANN prevent humans from gaining an insight into which process parameters give a significant influence on the final prediction output. Therefore, in this paper, we develop a simpler ANN model by using structural learning with forgetting (SLF) as the algorithm for the training process. Instead of typical backpropagation which generated a fully connected layer of the ANN model, SLF only reveals the important neurons and connections. Hence, the training process of SLF leaves only influential connections and neurons. Since each of the neurons specifically on the input layer represent each of the injection molding parameters, the ANN-SLF model can be further investigated to determine the influential process parameters. By applying SLF to the ANN training process, this experiment has successfully extracted a set of significant injection molding process parameters. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

14 pages, 1639 KiB  
Article
Explainable Stacking-Based Model for Predicting Hospital Readmission for Diabetic Patients
by Haohui Lu and Shahadat Uddin
Information 2022, 13(9), 436; https://doi.org/10.3390/info13090436 - 15 Sep 2022
Cited by 9 | Viewed by 4048
Abstract
Artificial intelligence is changing the practice of healthcare. While it is essential to employ such solutions, making them transparent to medical experts is more critical. Most of the previous work presented disease prediction models, but did not explain them. Many healthcare stakeholders do [...] Read more.
Artificial intelligence is changing the practice of healthcare. While it is essential to employ such solutions, making them transparent to medical experts is more critical. Most of the previous work presented disease prediction models, but did not explain them. Many healthcare stakeholders do not have a solid foundation in these models. Treating these models as ‘black box’ diminishes confidence in their predictions. The development of explainable artificial intelligence (XAI) methods has enabled us to change the models into a ‘white box’. XAI allows human users to comprehend the results from machine learning algorithms by making them easy to interpret. For instance, the expenditures of healthcare services associated with unplanned readmissions are enormous. This study proposed a stacking-based model to predict 30-day hospital readmission for diabetic patients. We employed Random Under-Sampling to solve the imbalanced class issue, then utilised SelectFromModel for feature selection and constructed a stacking model with base and meta learners. Compared with the different machine learning models, performance analysis showed that our model can better predict readmission than other existing models. This proposed model is also explainable and interpretable. Based on permutation feature importance, the strong predictors were the number of inpatients, the primary diagnosis, discharge to home with home service, and the number of emergencies. The local interpretable model-agnostic explanations method was also employed to demonstrate explainability at the individual level. The findings for the readmission of diabetic patients could be helpful in medical practice and provide valuable recommendations to stakeholders for minimising readmission and reducing public healthcare costs. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

18 pages, 346 KiB  
Article
Optimized Screening for At-Risk Students in Mathematics: A Machine Learning Approach
by Okan Bulut, Damien C. Cormier and Seyma Nur Yildirim-Erbasli
Information 2022, 13(8), 400; https://doi.org/10.3390/info13080400 - 22 Aug 2022
Cited by 1 | Viewed by 2589
Abstract
Traditional screening approaches identify students who might be at risk for academic problems based on how they perform on a single screening measure. However, using multiple screening measures may improve accuracy when identifying at-risk students. The advent of machine learning algorithms has allowed [...] Read more.
Traditional screening approaches identify students who might be at risk for academic problems based on how they perform on a single screening measure. However, using multiple screening measures may improve accuracy when identifying at-risk students. The advent of machine learning algorithms has allowed researchers to consider using advanced predictive models to identify at-risk students. The purpose of this study is to investigate if machine learning algorithms can strengthen the accuracy of predictions made from progress monitoring data to classify students as at risk for low mathematics performance. This study used a sample of first-grade students who completed a series of computerized formative assessments (Star Math, Star Reading, and Star Early Literacy) during the 2016–2017 (n = 45,478) and 2017–2018 (n = 45,501) school years. Predictive models using two machine learning algorithms (i.e., Random Forest and LogitBoost) were constructed to identify students at risk for low mathematics performance. The classification results were evaluated using evaluation metrics of accuracy, sensitivity, specificity, F1, and Matthews correlation coefficient. Across the five metrics, a multi-measure screening procedure involving mathematics, reading, and early literacy scores generally outperformed single-measure approaches relying solely on mathematics scores. These findings suggest that educators may be able to use a cluster of measures administered once at the beginning of the school year to screen their first grade for at-risk math performance. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

11 pages, 920 KiB  
Article
Improved Carpooling Experience through Improved GPS Trajectory Classification Using Machine Learning Algorithms
by Manish Kumar Pandey, Anu Saini, Karthikeyan Subbiah, Nalini Chintalapudi and Gopi Battineni
Information 2022, 13(8), 369; https://doi.org/10.3390/info13080369 - 3 Aug 2022
Cited by 2 | Viewed by 2369
Abstract
Globally, smart cities, infrastructure, and transportation have led to a rise in vehicle numbers, resulting in an increasing number of problems. This includes problems such as air pollution, noise pollution, high energy consumption, and people’s health. A viable solution to these problems is [...] Read more.
Globally, smart cities, infrastructure, and transportation have led to a rise in vehicle numbers, resulting in an increasing number of problems. This includes problems such as air pollution, noise pollution, high energy consumption, and people’s health. A viable solution to these problems is carpooling, which involves sharing vehicles between people going to the same location. As carpooling solutions become more popular, they need to be implemented efficiently. Data analytics can help people make informed decisions when selecting a ride (Car or Bus). We applied machine learning algorithms to select the desired ride (Car or Bus) and used feature ranking algorithms to identify the foremost traits for selecting the desired ride. Based on the performance evaluation metric, 11 classifiers were used for the experiment. In terms of selecting the desired ride, Random Forest performs best. Using ten-fold cross-validation, we obtained a sensitivity of 87.4%, a specificity of 73.7%, an accuracy of 81.0%, a sensitivity of 90.8%, a specificity of 77.6%, and an accuracy of 84.7% using leave-one-out cross-validation. To identify the most favorable characteristics of the Ride (Car or Bus), the recursive elimination of features algorithm was applied. By identifying the factors contributing to users’ experience, the service providers will be able to rectify those factors to increase business. It has been determined that the weather can make or break the user experience. This model will be used to quantify and map intrinsic and extrinsic sentiments of the people and their interactions with locality, socio-economic conditions, climate, and environment. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

14 pages, 519 KiB  
Article
Incorporating a Machine Learning Model into a Web-Based Administrative Decision Support Tool for Predicting Workplace Absenteeism
by Gopal Nath, Yawei Wang, Austin Coursey, Krishna K. Saha, Srikanth Prabhu and Saptarshi Sengupta
Information 2022, 13(7), 320; https://doi.org/10.3390/info13070320 - 30 Jun 2022
Cited by 3 | Viewed by 3629
Abstract
Productivity losses caused by absenteeism at work cost U.S. employers billions of dollars each year. In addition, employers typically spend a considerable amount of time managing employees who perform poorly. By using predictive analytics and machine learning algorithms, organizations can make better decisions, [...] Read more.
Productivity losses caused by absenteeism at work cost U.S. employers billions of dollars each year. In addition, employers typically spend a considerable amount of time managing employees who perform poorly. By using predictive analytics and machine learning algorithms, organizations can make better decisions, thereby increasing organizational productivity, reducing costs, and improving efficiency. Thus, in this paper we propose hybrid optimization methods in order to find the most parsimonious model for absenteeism classification. We utilized data from a Brazilian courier company. In order to categorize absenteeism classes, we preprocessed the data, selected the attributes via multiple methods, balanced the dataset using the synthetic minority over-sampling method, and then employed four methods of machine learning classification: Support Vector Machine (SVM), Multinomial Logistic Regression (MLR), Artificial Neural Network (ANN), and Random Forest (RF). We selected the best model based on several validation scores, and compared its performance against the existing model. Furthermore, project managers may lack experience in machine learning, or may not have the time to spend developing machine learning algorithms. Thus, we propose a web-based interactive tool supported by cognitive analytics management (CAM) theory. The web-based decision tool enables managers to make more informed decisions, and can be used without any prior knowledge of machine learning. Understanding absenteeism patterns can assist managers in revising policies or creating new arrangements to reduce absences in the workplace, financial losses, and the probability of economic insolvency. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

17 pages, 713 KiB  
Article
A Forward-Looking Approach to Compare Ranking Methods for Sports
by Peter Juma Ochieng, András London and Miklós Krész
Information 2022, 13(5), 232; https://doi.org/10.3390/info13050232 - 3 May 2022
Cited by 6 | Viewed by 3624
Abstract
In this paper, we provide a simple forward-looking approach to compare rating methods with respect to their stability over time. Given a rating vector of entities involved in the comparison and a ranking indicated by the rating, the stability of the methods is [...] Read more.
In this paper, we provide a simple forward-looking approach to compare rating methods with respect to their stability over time. Given a rating vector of entities involved in the comparison and a ranking indicated by the rating, the stability of the methods is measured by the change in rating vector and ranks of the entities over time from a forward-looking perspective. We investigate various linear algebraic rating methods and use the Euclidean distance and Kendall tau rank correlation to measure their stability in rating and ranking, respectively. The investigations are based on both rolling and expanding window approaches. We apply the methodology to sports as a widely known ranking and rating environment. The results suggest that PageRank and Massey rating methods provide better rating and ranking stability than simple methods, such as winning percentage, and more advanced ones, such as Colley’s least square and Keener’s eigenvector-based method. Finally, a simple way to examine the potential predictive power of the rating methods is also provided. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

19 pages, 688 KiB  
Article
Prediction of Rainfall in Australia Using Machine Learning
by Antonio Sarasa-Cabezuelo
Information 2022, 13(4), 163; https://doi.org/10.3390/info13040163 - 24 Mar 2022
Cited by 16 | Viewed by 8330
Abstract
Meteorological phenomena is an area in which a large amount of data is generated and where it is more difficult to make predictions about events that will occur due to the high number of variables on which they depend. In general, for this, [...] Read more.
Meteorological phenomena is an area in which a large amount of data is generated and where it is more difficult to make predictions about events that will occur due to the high number of variables on which they depend. In general, for this, probabilistic models are used that offer predictions with a margin of error, so that in many cases they are not very good. Due to the aforementioned conditions, the use of machine learning algorithms can serve to improve predictions. This article describes an exploratory study of the use of machine learning to make predictions about the phenomenon of rain. To do this, a set of data was taken as an example that describes the measurements gathered on rainfall in the main cities of Australia in the last 10 years, and some of the main machine learning algorithms were applied (knn, decision tree, random forest, and neural networks). The results show that the best model is based on neural networks. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

16 pages, 993 KiB  
Article
Automatic Eligibility of Sellers in an Online Marketplace: A Case Study of Amazon Algorithm
by Álvaro Gómez-Losada, Gualberto Asencio-Cortés and Néstor Duch-Brown
Information 2022, 13(2), 44; https://doi.org/10.3390/info13020044 - 19 Jan 2022
Cited by 5 | Viewed by 5172
Abstract
Purchase processes on Amazon Marketplace begin at the Buy Box, which represents the buy click process through which numerous sellers compete. This study aimed to estimate empirically the relevant seller characteristics that Amazon could consider featuring in the Buy Box. To that end, [...] Read more.
Purchase processes on Amazon Marketplace begin at the Buy Box, which represents the buy click process through which numerous sellers compete. This study aimed to estimate empirically the relevant seller characteristics that Amazon could consider featuring in the Buy Box. To that end, 22 product categories from Italy’s Amazon web page were studied over a ten-month period, and the sellers were analyzed through their products featured in the Buy Box. Two different experiments were proposed and the results were analyzed using four classification algorithms (a neural network, random forest, support vector machine, and C5.0 decision trees) and a rule-based classification. The first experiment aimed to characterize sellers unspecifically by predicting their change at the Buy Box. The second one aimed to predict which seller would be featured in it. Both experiments revealed that the customer experience and the dynamics of the sellers’ prices were important features of the Buy Box. Additionally, we proposed a set of default features that Amazon could consider when no information about sellers was available. We also proposed the possible existence of a relationship or composition among important features that could be used for sellers to be featured in the Buy Box. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

15 pages, 3515 KiB  
Article
Dual-Hybrid Modeling for Option Pricing of CSI 300ETF
by Kejing Zhao, Jinliang Zhang and Qing Liu
Information 2022, 13(1), 36; https://doi.org/10.3390/info13010036 - 13 Jan 2022
Cited by 7 | Viewed by 3000
Abstract
The reasonable pricing of options can effectively help investors avoid risks and obtain benefits, which plays a very important role in the stability of the financial market. The traditional single option pricing model often fails to meet the ideal expectations due to its [...] Read more.
The reasonable pricing of options can effectively help investors avoid risks and obtain benefits, which plays a very important role in the stability of the financial market. The traditional single option pricing model often fails to meet the ideal expectations due to its limited conditions. Combining an economic model with a deep learning model to establish a hybrid model provides a new method to improve the prediction accuracy of the pricing model. This includes the usage of real historical data of about 10,000 sets of CSI 300 ETF options from January to December 2020 for experimental analysis. Aiming at the prediction problem of CSI 300ETF option pricing, based on the importance of random forest features, the Convolutional Neural Network and Long Short-Term Memory model (CNN-LSTM) in deep learning is combined with a typical stochastic volatility Heston model and stochastic interests CIR model in parameter models. The dual hybrid pricing model of the call option and the put option of CSI 300ETF is established. The dual-hybrid model and the reference model are integrated with ridge regression to further improve the forecasting effect. The results show that the dual-hybrid pricing model proposed in this paper has high accuracy, and the prediction accuracy is tens to hundreds of times higher than the reference model; moreover, MSE can be as low as 0.0003. The article provides an alternative method for the pricing of financial derivatives. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

15 pages, 1196 KiB  
Article
Predicting COVID-19 Cases in South Korea with All K-Edited Nearest Neighbors Noise Filter and Machine Learning Techniques
by David Opeoluwa Oyewola, Emmanuel Gbenga Dada, Sanjay Misra and Robertas Damaševičius
Information 2021, 12(12), 528; https://doi.org/10.3390/info12120528 - 19 Dec 2021
Cited by 10 | Viewed by 3447
Abstract
The application of machine learning techniques to the epidemiology of COVID-19 is a necessary measure that can be exploited to curtail the further spread of this endemic. Conventional techniques used to determine the epidemiology of COVID-19 are slow and costly, and data are [...] Read more.
The application of machine learning techniques to the epidemiology of COVID-19 is a necessary measure that can be exploited to curtail the further spread of this endemic. Conventional techniques used to determine the epidemiology of COVID-19 are slow and costly, and data are scarce. We investigate the effects of noise filters on the performance of machine learning algorithms on the COVID-19 epidemiology dataset. Noise filter algorithms are used to remove noise from the datasets utilized in this study. We applied nine machine learning techniques to classify the epidemiology of COVID-19, which are bagging, boosting, support vector machine, bidirectional long short-term memory, decision tree, naïve Bayes, k-nearest neighbor, random forest, and multinomial logistic regression. Data from patients who contracted coronavirus disease were collected from the Kaggle database between 23 January 2020 and 24 June 2020. Noisy and filtered data were used in our experiments. As a result of denoising, machine learning models have produced high results for the prediction of COVID-19 cases in South Korea. For isolated cases after performing noise filtering operations, machine learning techniques achieved an accuracy between 98–100%. The results indicate that filtering noise from the dataset can improve the accuracy of COVID-19 case prediction algorithms. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

27 pages, 5291 KiB  
Article
Private Car O-D Flow Estimation Based on Automated Vehicle Monitoring Data: Theoretical Issues and Empirical Evidence
by Antonio Comi, Alexander Rossolov, Antonio Polimeni and Agostino Nuzzolo
Information 2021, 12(12), 493; https://doi.org/10.3390/info12120493 - 26 Nov 2021
Cited by 22 | Viewed by 4266
Abstract
Data on the daily activity of private cars form the basis of many studies in the field of transportation engineering. In the past, in order to obtain such data, a large number of collection techniques based on travel diaries and driver interviews were [...] Read more.
Data on the daily activity of private cars form the basis of many studies in the field of transportation engineering. In the past, in order to obtain such data, a large number of collection techniques based on travel diaries and driver interviews were used. Telematics applied to vehicles and to a broad range of economic activities has opened up new opportunities for transportation engineers, allowing a significant increase in the volume and detail level of data collected. One of the options for obtaining information on the daily activity of private cars now consists of processing data from automated vehicle monitoring (AVM). Therefore, in this context, and in order to explore the opportunity offered by telematics, this paper presents a methodology for obtaining origin–destination flows through basic info extracted from AVM/floating car data (FCD). Then, the benefits of such a procedure are evaluated through its implementation in a real test case, i.e., the Veneto region in northern Italy where full-day AVM/FCD data were available with about 30,000 vehicles surveyed and more than 388,000 trips identified. Then, the goodness of the proposed methodology for O-D flow estimation is validated through assignment to the road network and comparison with traffic count data. Taking into account aspects of vehicle-sampling observations, this paper also points out issues related to sample representativeness, both in terms of daily activities and spatial coverage. A preliminary descriptive analysis of the O-D flows was carried out, and the analysis of the revealed trip patterns is presented. Full article
(This article belongs to the Special Issue Predictive Analytics and Data Science)
Show Figures

Figure 1

Back to TopTop