Prediction and Visualisation of SICONV Project Profiles Using Machine Learning
Abstract
:1. Introduction
2. Materials and Methods
2.1. The Database SICONV
2.2. Project Representation
2.2.1. Variable Selection
2.2.2. Variable Definition
2.2.3. Variable Characterisation
2.2.4. Variable Transformation
2.2.5. Feature Correlation Analysis
- The more parallel a vector is to a PC axis, the more it contributes to that PC.
- The longer the vector, the more variability of this variable is represented by the two principal components displayed.
- Small angles between vectors indicate high positive correlation, right angles indicate no correlation, and opposite angles indicate high negative correlation.
2.2.6. Feature Vector Definition
2.3. Project Visualisation, Grouping and Labelling
2.4. Definition of the Project Profile
2.5. Modelling
- Classification Accuracy (): is defined in Equation (1), in which n is the number of observations, i is the i-th observation, is the weight and is a function using the Iverson bracket notation (Equation (2)), which holds a value of one if the target () equals the response () and zero, otherwise. In this study, w is normalized such that the sum of its values is one, and has the same value for all observations and assessment metrics.
- Balanced Accuracy (bacc): bacc computes the weighted balanced accuracy, suitable for imbalanced data sets. Equation (3) defines bacc, in which is the class of the i-th observation and is the class of the j-th observation, of a multi-class problem with k classes.
- Classification Error (ce): ce compares true observed labels with predicted labels in multi-class classification tasks. Equation (5) defines ce.
- Log Loss (): compares true observed labels with predicted probabilities in multi-class classification tasks. This is defined in Equation (6), in which is the probability for the true class of observation i.
- Multi-class Brier Score (): compares true observed labels with predicted probabilities in multi-class classification tasks. is defined in Equation (7), in which is 1 if observation i possesses the true label j, and 0 otherwise.
- : it is also called true positive rate or sensitivity. This is defined in Equation (8), in which is the number of true positives and the number of false negatives.
- : it is also called true negative rate. This is defined in Equation (9), in which is the number of true negatives, the number of false positives and the number of true negatives.
- F-beta Score (): compares true observed labels with predicted labels in binary classification tasks. is defined in Equation (10). In this study , which is a measure called the score.
3. Results and Discussion
4. Study Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
SICONV | System of Management of Agreements and Contracts of Transfer |
BPMN | Business Process Model and Notation |
MAPA | Ministry of Agriculture, Cattle and Supply |
PCA | Principal Component Analysis |
t-SNE | t-Distributed Stochastic Neighbor Embedding |
MLR3 | Machine Learning in R |
R$ | the Brazilian real, which is the official currency of Brazil |
Appendix A
Classifier | Description |
---|---|
classif.AdaBoostM1 | AdaBoost generates a set of hypotheses and combines them using weighted majority voting. Training a weak classifier with iteratively updated training data generates hypotheses. This increases the likelihood that misclassified cases will be included in the training data of the classifier. Training data for successive classifiers focus on harder-to-classify cases [48]. |
classif.C50 | The decision tree divides a dataset into smaller subsets. The leaf node represents a decision and each branch represents a value. Classification starts at the root node and is classified according to features. Algorithm C5.0 is derived from algorithm C4.5, which is derived from algorithm ID3. The C5.0 algorithm has the advantage over the ID3 and C4.5 algorithms: speed, better memory usage, smaller decision trees [49]. |
classif.catboost | Categorical Boosting (catBoost)—CatBoost handles categorical features using binary decision trees as base predictors and different permutations for different steps of gradient boosting. CatBoost is an implementation of gradient boosting. CatBoost is indicated for studies involving categorical and heterogeneous data [50]. |
classif.ctree | Conditional Inferences Trees (cTREE)—The CTree method recursively partitions the data by performing a univariate division on the dependent variable, just like traditional decision trees. However, the CTree method uses a classical statistical significance test, selecting a division point based on the minimum p-value of all independence tests, between the response variable and each explanatory variable [51]. |
classif.cv_glmnet | Cross validation Generalized Linear Models With Elastic Net Regularization (cglmnet)—GLMNET can use Lasso or Cyclical Coordinate Descent Algorithms, repeating the cycle to convergence, successively optimizing the objective function on each parameter with the others fixed. GLMNET is a package that fits into linear and similar models generalized by maximum penalized likelihood. It can be used for linear, logistics and multinomial regression. One of GLMNET’s main tuning parameters is the regularization penalty, hence GLMNET has a set of values called regularization path. Path is specified by the argument called Lambda. Cvg_lmnet uses cross-validation to optimize the Lambda value [52]. |
classif.featureless | Featureless— The Featureless classifier uses the distance of objects and ignores all features. Objects are classified according to distances of a subset of training objects. The distances obtained are combined with classifiers that can be linear or non-linear [53]. |
classif.gbm | Gradient Boosting Machines (GBM)—GBM is used to solve regression and data classification problems. The learning model is based on consecutively fitting new models to provide a more accurate estimate of the response variable. GBM analyses the predictors and chooses the strongest predictors. GBM performance can be improved by using an additional classifier [54,55]. |
classif.glmnet | Generalized Linear Models with Elastic Net Regularization (glmnet)—Similar to cv_glmnet, but uses a cost-sensitive measure to optimize the lambda value. |
classif.IBk | Instance-Bases Learning with parameter k (IBk)—IBk is a k-Nearest-Neighbour classifier and is in the Lazy classifier category. k is a value that determines the number of neighbours that are analysed and the outcome is determined by majority vote. The value of K can be selected based on cross-validation. The basic principle of this algorithm is that when the instance is given, the algorithm searches in the training dataset for its closest instance samples, through use, most commonly, of Euclidean distance, which is used to assign the class for the test sample [56]. |
classif.JRip | Repeated Incremental Pruning to Produce Error Reduction (RIPPER)—JRip is an optimized version of IREP. JRip uses propositional rules that can be executed to classify elements; the rules are created through sequential algorithms. The JRip algorithm creates rules for each dataset, considering the features of the evaluated class; subsequently the next class will also be evaluated and measured according to the previous class. This cycle is repeated until the last class is evaluated [57,58]. |
classif.kknn | k-Nearest- Neighbour—The kNN classifier classifies unlabelled observations, assigning these to the most similar labelled class. When a data point is provided, KNN searches the training dataset for its nearest K samples to the data point, commonly using the Euclidean distance. The parameter k determines how many neighbours will be chosen for the kNN algorithm [59,60]. |
classif.lda | Linear Discriminant Analysis (LDA)—LDA is used to distinguish two distinct classes through the linear combination of features. This combination can be used for classification or dimension reduction. Through this method it is possible to project a multidimensional data set in only one dimension, resulting in a single feature [61,62]. |
classif.liblinear | Library for Large-Scale Linear (liblinear)—LibLINEAR is an open source library and uses a coordinate descent algorithm. LibLINEAR supports logistic regression (LR) and linear support vector machines (SVM). LibLINEAR can classify data that can be linearly separated via a hyperplane [63,64]. |
classif.lightgbm | Light Gradient boosting algorithm (LightGBM)— This algorithm is based on decision tree algorithms. LightGBM is the implementation of Gradient Boosting with Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) [65]. |
classif.LMT | Logistic Model Trees (LMT)—A logistic model tree (LMT) combines a decision tree and linear logistic regression. LMT uses a tree-growing approach called LogitBoost to refine logistic regression models along their corresponding paths. Additive logistic regression modelling by LogitBoost provides a way to build a leaf model from a partial linear model, which is inherited from its ancestor nodes as the tree grows [66]. |
classif.naive_bayes | Naive Bayesian (naive_bayes)—Naive Bayesian uses the construction of a Bayesian probabilistic model (based on Bayes’ theorem). The Naive Bayesian classifier only needs the mean and variance parameters of the variables and assumes that the variables are independent [67]. |
classif.nnet | Neural Network (nnet)—Neural Network (NN) is a mathematical representation of the networks of neurons with input signals, while generating output constrained to propagate intheforwarddirection. Therefore, feedforward optimization is necessaryamongseveralalgorithms, from these back propagation (BP) is the most commonly used [67,68,69]. |
classif.OneR | One Rule (OneR)—OneR creates a rule for each predictor in the dataset, then selects the rule with the lowest misclassification rate and assumes this rule as a “one rule". To create a rule for a predictor, it builds a frequency table for each predictor against the class. Accordingly, for each predictor the rule is made as follows: for each predictor count how often each class value appears, find the most frequent class, make the rule assign that class to that predictor value, calculate the total error of the rules of each predictor, choose the predictor with the smallest total error [70,71,72]. |
classif.PART | Regression Partition Tree (PART)— The PART classification method uses the divide and conquer approach. The PART algorithm comes under classification rules, building a partial C4.5 decision tree during each iteration, using the J4.8 classifier technique. The PART algorithm creates rules recursively, then deletes the instances affected by those rules and repeats the process until there are no more instances, the best leaf is turned into a rule [73,74,75,76]. |
classif.randomForest | Random Forest (randomForest)—The random forest is an algorithm that uses a combination of individual tree predictors, building multiple decision trees in the training stage. For each data point, each tree casts a vote for one class and the forest tries to predict the class based on the class that obtained the majority of votes [60,77]. |
classif.ranger | Random Classification Forest (ranger)—Fast implementation of Random Forest method [78]. |
classif.rfsrc | Random Forest for Survival, Regression, and Classification (rfsrc)—Random Forest SRC is an implementation of Random Forest for application in Survival, Regression, and Classification citepIshwaran2008. |
classif.rpart | Recursive Partitioning (rpart)—RPART is the implementation of Classification and Regression Trees (CART). This is a method that uses a recursive partitioning regression tree. The algorithm creates a large tree and then prunes the tree to a size that has the lowest cross-validation error estimate by evaluating the values of a cost-complexity parameter [79]. |
classif.svm | Support Vector Machine (SVM)—A support vector machine tries to classify data by a separating hyperplane. In this form, SVM separates the input data into two classes, trying to maximize the distance between the optimal hyperplane and the nearest training pattern [60]. |
classif.xgboost | eXtreme Gradient Boosting classification (xgboost)—XGBoost is a decision tree set, which consists of a set of classification or regression trees, based on Gradient Boosting, which iteratively calculates the prediction of multiple trees. The process is repeated several times until the accuracy or error is satisfactory. After each iteration, the model learns and adds new information to the set. The final model is a linear combination of hundreds to thousands of trees forming a regression model where each term is a tree [80,81]. |
Appendix B
References
- de Lacerda, L.F.T. Analysis of the Quality of Accountability of Private Foundations in the Federal District to the Public Ministry of the Federal District and Territories. Bachelor Dissertation, Universidade de Brasília, Brasília, Brazil, 2017. Available online: https://bdm.unb.br/handle/10483/18432 (accessed on 1 August 2022).
- Portulhak, H.; Vaz, P.V.C.; Delay, A.J.; Pacheco, V. The quality of third sector organizations’ accountability: An analysis from its relationship with the behavior of individual donors. Enfoque Reflexão Contábil 2017, 36, 45–63. [Google Scholar] [CrossRef] [Green Version]
- Trussel, J.M.; Parsons, L.M. Financial reporting factors affecting donations to charitable organizations. Adv. Account. 2007, 23, 263–285. [Google Scholar] [CrossRef]
- Rana, T.; Steccolini, I.; Bracci, E.; Mihret, D.G. Performance auditing in the public sector: A systematic literature review and future research avenues. Financ. Account. Manag. 2021, 38, 337–359. [Google Scholar] [CrossRef]
- Otia, J.E.; Bracci, E. Digital transformation and the public sector auditing: The SAI’s perspective. Financ. Account. Manag. 2022, 38, 252–280. [Google Scholar] [CrossRef]
- Sun, T.; Sales, L.J. Predicting public procurement irregularity: An application of neural networks. J. Emerg. Technol. Account. 2018, 15, 141–154. [Google Scholar] [CrossRef]
- Zhang, X. Construction and simulation of financial audit model based on convolutional neural network. Comput. Intell. Neurosci. 2021, 2021, 1–11. [Google Scholar] [CrossRef]
- Mongwe, W.T.; Mbuvha, R.; Marwala, T. Bayesian inference of local government audit outcomes. PLoS ONE 2021, 16, e0261245. [Google Scholar] [CrossRef]
- Khan, A.T.; Cao, X.; Li, S.; Katsikis, V.N.; Brajevic, I.; Stanimirovic, P.S. Fraud detection in publicly traded u.s firms using beetle antennae search: A machine learning approach. Expert Syst. Appl. 2022, 191, 116148. [Google Scholar] [CrossRef]
- Jiang, Y.; Jones, S. Corporate distress prediction in China: A machine learning approach. Account. Financ. 2018, 58, 1063–1109. [Google Scholar] [CrossRef] [Green Version]
- Abbasi, A.; Albrecht, C.; Vance, A.; Hansen, J. MetaFraud: A meta-learning framework for detecting financial fraud. MIS Q. 2012, 36, 1293–1327. [Google Scholar] [CrossRef] [Green Version]
- Hamal, S.; Senvar, O. Comparing performances and effectiveness of machine learning classifiers in detecting financial accounting fraud for Turkish SMEs. Int. J. Comput. Intell. Syst. 2021, 14, 769–782. [Google Scholar] [CrossRef]
- Bertomeu, J.; Cheynel, E.; Floyd, E.; Pan, W. Using machine learning to detect misstatements. Rev. Account. Stud. 2020, 26, 468–519. [Google Scholar] [CrossRef]
- Bao, Y.; Ke, B.; Li, B.; Yu, Y.J.; Zhang, J. Detecting accounting fraud in publicly traded U.S. firms using a machine learning approach. J. Account. Res. 2020, 58, 199–235. [Google Scholar] [CrossRef]
- Zhang, X. Application of data mining and machine learning in management accounting information system. J. Appl. Sci. Eng. 2021, 24, 813–820. [Google Scholar] [CrossRef]
- Song, X.P.; Hu, Z.H.; Du, J.G.; Sheng, Z.H. Application of machine learning methods to risk assessment of financial statement fraud: Evidence from China. J. Forecast. 2014, 33, 611–626. [Google Scholar] [CrossRef]
- Papík, M.; Papíková, L. Detecting accounting fraud in companies reporting under US GAAP through data mining. Int. J. Account. Inf. Syst. 2022, 45, 100559. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, S. Accounting information disclosure and financial crisis beforehand warning based on the artificial neural network. Wirel. Commun. Mob. Comput. 2022, 2022, 1–11. [Google Scholar] [CrossRef]
- Li, Q. Parallel bookkeeping path of accounting in government accounting system based on deep neural network. J. Electr. Comput. Eng. 2022, 2022, 1–10. [Google Scholar] [CrossRef]
- Liu, L. Evaluation method of financial accounting quality in colleges and universities based on dynamic neuron model. Comput. Intell. Neurosci. 2022, 2022, 1–11. [Google Scholar] [CrossRef]
- Cecchini, M.; Aytug, H.; Koehler, G.J.; Pathak, P. Detecting management fraud in public companies. Manag. Sci. 2010, 56, 1146–1160. [Google Scholar] [CrossRef]
- Kuzey, C.; Uyar, A.; Delen, D. An investigation of the factors influencing cost system functionality using decision trees, support vector machines and logistic regression. Int. J. Account. Inf. Manag. 2019, 27, 27–55. [Google Scholar] [CrossRef]
- de Laat, P.B. Algorithmic decision-making based on machine learning from big data: Can transparency restore accountability? Philos. Technol. 2017, 31, 525–541. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bakumenko, A.; Elragal, A. Detecting anomalies in financial data using machine learning algorithms. Systems 2022, 10, 130. [Google Scholar] [CrossRef]
- Zou, J.; Fu, X.; Yang, J.; Gong, C. Measuring bank systemic risk in china: A network model analysis. Systems 2022, 10, 14. [Google Scholar] [CrossRef]
- Nonaka, T.H. Estudo comparativo dos manuais de prestação de contas do governo federal. Bachelor Dissertation, Universidade de Brasília, Brasília, Brazil, 2013. Available online: http://bdm.unb.br/handle/10483/12574 (accessed on 1 August 2022).
- Pereira, J.R.T.; Filho, J.B.C. Rejeições de prestação de contas de governos municipais: O que está acontecendo? Contabilidade Gestão e Governança 2012, 15, 33–43. Available online: https://www.revistacgg.org/index.php/contabil/article/view/393 (accessed on 1 August 2022).
- Lima, M.B. Organizações não governamentais (ONGs): Um estudo sobre a transparência na elaboração da prestação de contas e dos relatórios financeiros emitidos nas organizações não governamentais do DF. Bachelor Dissertation, Universidade de Brasília, Brasília, Brazil, 2011. [Google Scholar] [CrossRef]
- e Barros, F.H.G.; Neto, M.S. Inserindo a dimensão de resultados nas prestações de contas. Revista do Tribunal de Contas da União 2010, 119, 65–70. Available online: https://revista.tcu.gov.br/ojs/index.php/RTCU/article/view/201/194 (accessed on 3 December 2022).
- Tomaskova, H.; Kopecky, M. Specialization of business process model and notation applications in medicine—A review. Data 2020, 5, 99. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Moutinho, J.d.A.; Rabechini Junior, R. Adherence between project management and the management system of agreements and transfer contracts (SICONV). Syst. Manag. 2017, 12, 83–97. [Google Scholar] [CrossRef] [Green Version]
- Borchers, H.W. Pracma: Practical Numerical Math Functions, R Package Version 2.3.8; 2022. Available online: https://cran.r-project.org/web/packages/pracma/index.html (accessed on 1 August 2022).
- Abdi, H.; Williams, L.J. Principal component analysis. WIREs Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Kassambara, A.; Mundt, F. factoextra: Extract and Visualize the Results of Multivariate Data Analyses, R Package Version 1.0.7; 2020. Available online: https://cran.r-project.org/web/packages/factoextra/readme/README.html (accessed on 1 August 2022).
- Hartmann, K.; Krois, J. E-Learning Project SOGA: Statistics and Geospatial Data Analysis; Department of Earth Sciences, Freie Universitaet Berlin: Berlin, Germany, 2018; Available online: https://www.geo.fu-berlin.de/en/v/soga/index.html (accessed on 12 September 2022).
- van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Krijthe, J.H. Rtsne: T-Distributed Stochastic Neighbor Embedding Using Barnes-Hut Implementation, R Package Version 0.16; 2015. Available online: https://cran.r-project.org/web/packages/Rtsne/index.html (accessed on 1 August 2022).
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. Appl. Stat. 1979, 28, 100. [Google Scholar] [CrossRef]
- Lang, M.; Binder, M.; Richter, J.; Schratz, P.; Pfisterer, F.; Coors, S.; Au, Q.; Casalicchio, G.; Kotthoff, L.; Bischl, B. mlr3: A modern object-oriented machine learning framework in R. J. Open Source Softw. 2019, 4, 1903. [Google Scholar] [CrossRef]
- Sonabend, R.; Schratz, P.; Fischer, S. mlr3extralearners: Extra Learners for mlr3, R Package Version 0.5.48; 2022. Available online: https://github.com/mlr-org/mlr3extralearners (accessed on 1 August 2022).
- Lang, M. mlr3measures: Performance Measures for ‘mlr3’, R Package Version 0.5.0; 2022. Available online: https://cran.r-project.org/web/packages/mlr3measures/index.html (accessed on 1 August 2022).
- Peng, W.; Ye, Z.S.; Chen, N. Bayesian deep-learning-based health prognostics toward prognostics uncertainty. IEEE Trans. Ind. Electron. 2020, 67, 2283–2293. [Google Scholar] [CrossRef]
- Zhang, L.; Xu, A.; An, L.; Li, M. Bayesian inference of system reliability for multicomponent stress-strength model under Marshall-Olkin Weibull distribution. Systems 2022, 10, 196. [Google Scholar] [CrossRef]
- Bandyopadhyay, S.; Maulik, U. An evolutionary technique based on k-means algorithm for optimal clustering in RN. Inf. Sci. 2002, 146, 221–237. [Google Scholar] [CrossRef]
- Ikotun, A.M.; Almutari, M.S.; Ezugwu, A.E. K-means-based nature-inspired metaheuristic algorithms for automatic data clustering problems: Recent advances and future directions. Appl. Sci. 2021, 11, 11246. [Google Scholar] [CrossRef]
- Ikotun, A.M.; Ezugwu, A.E. Boosting k-means clustering with symbiotic organisms search for automatic clustering problems. PLoS ONE 2022, 17, 1–33. [Google Scholar] [CrossRef]
- Polikar, R. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 2006, 6, 21–45. [Google Scholar] [CrossRef]
- Pandya, R.; Pandya, J.; Dholakiya, K.P.; Amreli, I. C5.0 algorithm to improved decision tree with feature selection and reduced error pruning. Int. J. Comput. Appl. 2015, 117, 975–8887. [Google Scholar]
- Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data 2020, 7. [Google Scholar] [CrossRef]
- Maloney, K.O.; Weller, D.E.; Russell, M.J.; Hothorn, T. Classifying the biological condition of small streams: An example using benthic macroinvertebrates. J. N. Am. Benthol. Soc. 2009, 28, 869–884. [Google Scholar] [CrossRef]
- Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Duin, R.P.W.; De, D.; And, R.; Tax, D.M.J. Featureless pattern classification. Kybernetika 1998, 34, 399–404. Available online: https://www.kybernetika.cz/content/1998/4/399/paper.pdf (accessed on 3 December 2022).
- Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobotics 2013, 7, 21. [Google Scholar] [CrossRef] [PubMed]
- Shrivastav, L.K.; Jha, S.K. A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of COVID-19 in India. Appl. Intell. 2021, 51, 2727–2739. [Google Scholar] [CrossRef] [PubMed]
- Kalmegh, S.R. Effective classification of Indian News using Lazy classifier IB1And IBk from weka. Int. J. Inf. Comput. Sci. 2019, 6, 160–168. [Google Scholar]
- Gupta, A.; Mohammad, A.; Syed, A.; Halgamuge, M.N. A comparative study of classification algorithms using data mining: Crime and accidents in denver city the USA. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar] [CrossRef] [Green Version]
- Tarun, I.M.; Gerardo, B.D.; Tanguilig III, B.T. Generating licensure examination performance models using PART and JRip classifiers: A data mining application in education. Int. J. Comput. Commun. Eng. 2014, 3, 202–207. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z. Introduction to machine learning: K-nearest neighbors. Ann. Transl. Med. 2016, 4, 218. [Google Scholar] [CrossRef] [Green Version]
- Calil, B.C.; Da Cunha, D.V.; Vieira, M.F.; De Oliveira Andrade, A.; Furtado, D.A.; Bellomo Junior, D.P.; Pereira, A.A. Identification of arthropathy and myopathy of the temporomandibular syndrome by biomechanical facial features. Biomed. Eng. Online 2020, 19. [Google Scholar] [CrossRef] [Green Version]
- Bhardwaj, A.; Gupta, A.; Jain, P.; Rani, A.; Yadav, J. Classification of human emotions from EEG signals using SVM and LDA classifiers. In Proceedings of the 2nd International Conference on Signal Processing and Integrated Networks, SPIN 2015, Noida, India, 19–20 February 2015; pp. 180–185. [Google Scholar] [CrossRef]
- Cavalheiro, G.L.; Almeida, M.F.S.; Pereira, A.A.; Andrade, A.O. Study of age-related changes in postural control during quiet standing through linear discriminant analysis. Biomed. Eng. Online 2009, 8, 35. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Al-Zubaidi, A.; Rabee, F.; Al-Sulttani, A.H.; Al-Zubaidi, E.A. Classification of large-scale datasets of Landsat-8 satellite image based on LIBLINEAR library. Al-Salam J. Eng. Technol. 2022, 1, 9–17. [Google Scholar] [CrossRef]
- Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3149–3157. [Google Scholar] [CrossRef]
- Lee, S.; Jun, C.H. Fast incremental learning of logistic model tree using least angle regression. Expert Syst. Appl. 2018, 97, 137–145. [Google Scholar] [CrossRef]
- Park, Y. A comparison of neural net classifiers and linear tree classifiers: Their similarities and differences. Pattern Recognit. 1994, 27, 1493–1503. [Google Scholar] [CrossRef]
- Behera, S.S.; Chaudhuri, S.B.; Chattopadhyay, S. A comparative study on neural net classifier optimizations. Int. J. Adv. Eng. Technol. 2012, 179, 179–187. Available online: https://www.ijaet.org/media/0006/20I10-IJAET0907113-A-COMPARATIVE-STUDY.pdf (accessed on 3 December 2022).
- Behera, S.S.; Chattopadhyay, S. A comparative study of back propagation and simulated annealing algorithms for neural net classifier optimization. Procedia Eng. 2012, 38, 448–455. [Google Scholar] [CrossRef] [Green Version]
- Jamjoom, M. The pertinent single-attribute-based classifier for small datasets classification. Int. J. Electr. Comput. Eng. (IJECE) 2020, 10, 3227–3234. [Google Scholar] [CrossRef]
- Iyer, K.B.P.; Pavithra, K.; Nivetha, D.; Kumudhavarshini, K. Predictive analytics in diabetes using oner classification algorithm. IJCA Proc. Int. Conf. Commun. Comput. Inf. Technol. 2018, 14–19. Available online: https://research.ijcaonline.org/icccmit2017/number1/icccmit201718.pdf (accessed on 3 December 2022).
- Alam, F.; Pachauri, S. Comparative study of j48, Naive Bayes and One-R classification technique for credit card fraud detection using WEKA. Adv. Comput. Sci. Technol. 2017, 10, 1731–1743. Available online: https://www.ripublication.com/acst17/acstv10n6_19.pdf (accessed on 3 December 2022).
- Frank, E.; Witten, I.H. Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998; pp. 144–151. [Google Scholar] [CrossRef]
- Makalesi, A.; Kaya, Y.; Tekin, R. Comparison of discretization methods for classifier decision trees and decision rules on medical data sets. Eur. J. Sci. Technol. 2022, 275–281. [Google Scholar] [CrossRef]
- Nasa, C.; Cse Deptt, S.A.P. Evaluation of different classification techniques for WEB data. Int. J. Comput. Appl. 2012, 52, 975–8887. [Google Scholar] [CrossRef]
- Porwik, P.; Doroz, R.; Orczyk, T. The k-NN classifier and self-adaptive Hotelling data reduction technique in handwritten signatures recognition. Pattern Anal. Appl. 2015, 18, 983–1001. [Google Scholar] [CrossRef] [Green Version]
- Caie, P.D.; Dimitriou, N.; Arandjelović, O. Chapter 8 - Precision medicine in digital pathology via image analysis and machine learning. In Artificial Intelligence and Deep Learning in Pathology; Cohen, S., Ed.; Elsevier: Amsterdam, The Netherlands, 2021; pp. 149–173. [Google Scholar] [CrossRef]
- Amar, D.; Izraeli, S.; Shamir, R. Utilizing somatic mutation data from numerous studies for cancer research: Proof of concept and applications. Oncogene 2017, 36, 3375–3383. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Loh, W.Y. Fifty years of classification and regression trees. Int. Stat. Rev. 2014, 82, 329–348. [Google Scholar] [CrossRef]
- Carmona, P.; Dwekat, A.; Mardawi, Z. No more black boxes! Explaining the predictions of a machine learning XGBoost classifier algorithm in business failure. Res. Int. Bus. Financ. 2022, 61, 101649. [Google Scholar] [CrossRef]
- Bentéjac, C.; Csörgő, A.; Martínez-Mu noz, G. A comparative analysis of XGBoost. Artif. Intell. Rev. 2019, 54, 1937–1967. [Google Scholar] [CrossRef]
Study | Purpose | Type of Irregularity | Main Method | Maximum Accuracy |
---|---|---|---|---|
Mongwe et al. (2021) [8] | Fraud detection | Fairness irregularities | Bayesian logistic regression | 75.3% |
Khan et al. (2022) [9] | Fraud detection | Fairness irregularities | Beetle Antennae Search (BAS) | 84.9% |
Jiang and Jones (2018) [10] | Financial distress detection | Stability | Gradient Boosting Model (TreeNet) | 94.9% |
Zhang (2021) [7] | Financial audit | Procedural irregularities | Convolutional Neural Network | 93.4% |
Abbasi et al. (2012) [11] | Fraud detection | Fairness irregularities | Meta-learning | 80% |
Hamal and Senvar (2021) [12] | Fraud detection | Fairness irregularities | Random Forest | 93.7% |
Bertomeu et al. (2020) [13] | Misstatements | Financial data | Random Under-Sampling Boost (RUSBoost) | 76.3% |
Yang Bao et al. (2020) [14] | Fraud detection | Fairness irregularities | Random Under-Sampling Boost (RUSBoost) | 71.7% |
Zhang (2021) [15] | Management accounting information | Decision-making | Artificial Neural Network (ANN) | 100% |
Song et al. (2014) [16] | Fraud detection | Fairness irregularities | Ensemble of classifiers | 84.5% |
Papík and Papíkova (2022) [17] | Fraud detection | Fairness irregularities | Neural Network (NN) | 90.8% |
Chen and Zhang (2022) [18] | Financial crisis | Irregular accounting information | Artificial Neural Network (ANN) | 90.0% |
Li (2022) [19] | Parallel bookkeeping | Connection of Financial Accounting and Budget Accounting | Deep Neural Network | 87.7% |
Liu (2022) [20] | Financial Accounting Quality | Financial quality indicators | Dynamic Neuron Model | 98% |
Mongwe et al. (2021) [8] | Financial audit | Fraud and weak corporate governance | Bayesian logistic regression with automatic relevance determination (BLR-ARD) | 73% |
Cecchini et al. (2010) [21] | Fraud detection | Fairness irregularities | Support vector machines using the financial kernel (SVM-FK) | 87.8% |
Kuzey et al. (2019) [22] | Factors influencing cost system functionality | Cost data management process | Decision tree algorithm C5.0 (DT-C5.0) | 91.5% |
Variable | Definition | Accountability Factor |
---|---|---|
Period (in days) planned for the execution of the project | Regularity and Predictability | |
Period (in days) effectively used for the project execution | Efficiency | |
Total amount of the agreement (R$) considering the amount of government contribution and the counterpart of the applicant | Regularity and Predictability | |
Government contribution amount (R$) | Regularity and Predictability | |
Amount returned (R$) at the end of the agreement | Efficiency | |
Number of additive terms | Efficiency | |
Number of extensions | Efficiency |
Type | Statistic | (Days) | (Days) | (R$) | (R$) | (R$) | (Quantity) | (Quantity) |
---|---|---|---|---|---|---|---|---|
Range | min | 14 | 24 | 15,306 | 15,000 | 0.1 | 0.1 | 0.1 |
max | 1968 | 4019 | 72,482,484 | 69,527,434 | 6,115,575 | 22 | 6 | |
377 | 565 | 127,909.2 | 107,250.0 | 0.10 | 1.0 | 0.1 | ||
730 | 1,066 | 315,000.0 | 292,500.0 | 412.75 | 2.0 | 0.1 | ||
Centre | mean | 599.189 | 865.338 | 330,704.1 | 293,912.6 | 3,173.833 | 1.839 | 0.264 |
median | 547 | 735 | 205,000 | 193,621.5 | 8 | 1 | 0.1 | |
Spread | sd | 270.908 | 445.233 | 1,091,178 | 971,204.6 | 60,592.85 | 1.9 | 0.465 |
IQR | 353 | 501 | 187,090.8 | 185,250 | 412.65 | 1 | 0 | |
mad | 266.868 | 312.829 | 130,295.3 | 138,803.2 | 11.713 | 1.334 | 0 |
Cluster | (Days) | (Days) | (R$) | (R$) | (R$) | (Quantity) | (Quantity) |
---|---|---|---|---|---|---|---|
1 | 366 | 395 | 180,000 | 146,250 | 0.1 | 0.1 | 1 |
2 | 716 | 718 | 202,000 | 195,000 | 227 | 0.1 | 0.1 |
3 | 546 | 749 | 333,000 | 292,500 | 753 | 2 | 0.1 |
4 | 669 | 730 | 153,000 | 146,250 | 0.1 | 1 | 0.1 |
5 | 534 | 707 | 118,000 | 100,000 | 241 | 2 | 0.1 |
6 | 940 | 1044 | 210,000 | 195,000 | 289 | 1 | 0.1 |
7 | 576 | 670 | 209,580 | 195,000 | 0.1 | 0.1 | 0.1 |
8 | 453 | 748 | 425,000 | 390,000 | 0.1 | 2 | 0.1 |
9 | 444 | 1002 | 242,406 | 200,000 | 2 | 2 | 1 |
10 | 456 | 815 | 144,837 | 117,000 | 0.1 | 2 | 0.1 |
Cluster | ||||
---|---|---|---|---|
1 | , | , , , | ||
2 | , | , , | ||
3 | , , | , , | ||
4 | , | , , | ||
5 | , | , , | ||
6 | , | , , | ||
7 | , , | , , | ||
8 | , , | , | ||
9 | , , | , , , | ||
10 | , | , , |
Variable | 0% | 25% | 50% | 75% | 100% |
---|---|---|---|---|---|
366.00 | 453.75 | 540.00 | 645.75 | 940.00 | |
395.00 | 709.75 | 739.00 | 798.50 | 1,043.50 | |
118,000.0 | 159,750.0 | 205,790.0 | 234,304.1 | 425,000.0 | |
100,000 | 146,250 | 195,000 | 198,750 | 390,000 | |
0.10 | 0.10 | 1.05 | 237.50 | 753.00 | |
0.100 | 0.325 | 1.500 | 2.000 | 2.000 | |
0.1 | 0.1 | 0.1 | 0.1 | 1.0 |
Variable | Low | Medium | High |
---|---|---|---|
≤453.75 | >645.75 | ||
≤709.75 | >798.50 | ||
≤159,750 | >234,304.1 | ||
≤146,250 | >198,750 | ||
>237.50 | |||
≤0.325 | >2.00 | ||
>1.0 |
Classifier | |||||
---|---|---|---|---|---|
classif.AdaBoostM1 | 0.245 ± 0.006 | 0.199 ± 0.001 | 0.755 ± 0.006 | 1.774 ± 0.007 | 0.817 ± 0.002 |
classif.C50 | 0.97 ± 0.003 | 0.971 ± 0.002 | 0.03 ± 0.003 | 0.145 ± 0.013 | 0.054 ± 0.005 |
classif.catboost | 0.1 ± 0.006 | 0.1 ± 0.006 | 0.9 ± 0.006 | 34.539 ± 0.001 | 1 ± 0.001 |
classif.ctree | 0.96 ± 0.004 | 0.961 ± 0.004 | 0.04 ± 0.004 | 0.349 ± 0.043 | 0.063 ± 0.005 |
classif.cv_glmnet | 0.871 ± 0.007 | 0.866 ± 0.007 | 0.129 ± 0.007 | 0.417 ± 0.017 | 0.192 ± 0.007 |
classif.featureless | 0.138 ± 0.005 | 0.100 ± 0.001 | 0.862 ± 0.005 | 29.785 ± 0.169 | 1.725 ± 0.01 |
classif.gbm | 0.918 ± 0.005 | 0.917 ± 0.005 | 0.082 ± 0.005 | 0.35 ± 0.011 | 0.146 ± 0.005 |
classif.glmnet | 0.843 ± 0.005 | 0.83 ± 0.005 | 0.157 ± 0.005 | 0.659 ± 0.01 | 0.294 ± 0.004 |
classif.IBk | 0.978 ± 0.002 | 0.979 ± 0.002 | 0.022 ± 0.002 | 0.211 ± 0.022 | 0.043 ± 0.004 |
classif.JRip | 0.961 ± 0.004 | 0.962 ± 0.004 | 0.039 ± 0.004 | 0.32 ± 0.032 | 0.072 ± 0.008 |
classif.kknn | 0.991 ± 0.002 | 0.991 ± 0.002 | 0.009 ± 0.002 | 0.035 ± 0.01 | 0.016 ± 0.002 |
classif.lda | 0.849 ± 0.005 | 0.838 ± 0.006 | 0.151 ± 0.005 | 0.873 ± 0.053 | 0.256 ± 0.006 |
classif.liblinear | 0.831 ± 0.006 | 0.819 ± 0.006 | 0.169 ± 0.006 | 0.698 ± 0.01 | 0.316 ± 0.004 |
classif.lightgbm | 0.139 ± 0.125 | 0.134 ± 0.121 | 0.861 ± 0.125 | 12.976 ± 2.004 | 1.71 ± 0.25 |
classif.LMT | 0.96 ± 0.003 | 0.961 ± 0.003 | 0.04 ± 0.003 | 0.193 ± 0.026 | 0.062 ± 0.005 |
classif.naive_bayes | 0.849 ± 0.005 | 0.842 ± 0.005 | 0.151 ± 0.005 | 0.843 ± 0.042 | 0.227 ± 0.009 |
classif.nnet | 0.593 ± 0.144 | 0.581 ± 0.151 | 0.407 ± 0.144 | 1.064 ± 0.359 | 0.512 ± 0.133 |
classif.OneR | 0.243 ± 0.005 | 0.217 ± 0.004 | 0.757 ± 0.005 | 26.146 ± 0.16 | 1.514 ± 0.009 |
classif.PART | 0.968 ± 0.004 | 0.969 ± 0.004 | 0.032 ± 0.004 | 0.562 ± 0.092 | 0.059 ± 0.007 |
classif.randomForest | 0.976 ± 0.003 | 0.976 ± 0.002 | 0.024 ± 0.003 | 0.12 ± 0.005 | 0.047 ± 0.003 |
classif.ranger | 0.974 ± 0.003 | 0.974 ± 0.003 | 0.026 ± 0.003 | 0.141 ± 0.004 | 0.052 ± 0.002 |
classif.rfsrc | 0.977 ± 0.002 | 0.977 ± 0.002 | 0.023 ± 0.002 | 0.074 ± 0.006 | 0.036 ± 0.003 |
classif.rpart | 0.871 ± 0.006 | 0.864 ± 0.006 | 0.129 ± 0.006 | 0.466 ± 0.017 | 0.223 ± 0.009 |
classif.svm | 0.961 ± 0.002 | 0.962 ± 0.002 | 0.039 ± 0.002 | 0.108 ± 0.006 | 0.057 ± 0.003 |
classif.xgboost | 0.928 ± 0.007 | 0.927 ± 0.007 | 0.072 ± 0.007 | 1.178 ± 0.005 | 0.522 ± 0.002 |
Class | (×10) | (×10) | (×10) |
---|---|---|---|
1 | 99.39 | 99.42 | 99.95 |
2 | 99.56 | 99.61 | 99.95 |
3 | 99.04 | 99.30 | 99.80 |
4 | 99.15 | 99.04 | 99.93 |
5 | 98.63 | 98.86 | 99.80 |
6 | 98.65 | 98.33 | 99.89 |
7 | 99.27 | 99.05 | 99.96 |
8 | 99.20 | 99.08 | 99.92 |
9 | 99.91 | 99.99 | 99.98 |
10 | 98.65 | 98.60 | 99.83 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
de Oliveira Andrade, A.; Marques, L.G.; Resende, O.; Andrade de Oliveira, G.; Rodrigues da Silva Souza, L.; Alves Pereira, A. Prediction and Visualisation of SICONV Project Profiles Using Machine Learning. Systems 2022, 10, 252. https://doi.org/10.3390/systems10060252
de Oliveira Andrade A, Marques LG, Resende O, Andrade de Oliveira G, Rodrigues da Silva Souza L, Alves Pereira A. Prediction and Visualisation of SICONV Project Profiles Using Machine Learning. Systems. 2022; 10(6):252. https://doi.org/10.3390/systems10060252
Chicago/Turabian Stylede Oliveira Andrade, Adriano, Leonardo Garcia Marques, Osvaldo Resende, Geraldo Andrade de Oliveira, Leandro Rodrigues da Silva Souza, and Adriano Alves Pereira. 2022. "Prediction and Visualisation of SICONV Project Profiles Using Machine Learning" Systems 10, no. 6: 252. https://doi.org/10.3390/systems10060252
APA Stylede Oliveira Andrade, A., Marques, L. G., Resende, O., Andrade de Oliveira, G., Rodrigues da Silva Souza, L., & Alves Pereira, A. (2022). Prediction and Visualisation of SICONV Project Profiles Using Machine Learning. Systems, 10(6), 252. https://doi.org/10.3390/systems10060252