EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms
Abstract
:1. Introduction
- The framework provides simple Python object-oriented and parallel implementation of three common data preprocessing tasks with nature-inspired optimization methods.
- The framework is compatible with the well-established Python machine learning and data analysis libraries, scikit-learn, imbalanced-learn, pandas and NumPy. It is also compatible with the nature-inspired optimization framework NiaPy..
- The framework provides an easily extendable and customizable API, that can be customized with any scikit-learn compatible decision model or NiaPy compatible optimization method..
- The framework provides data preprocessing for regression supervised learning problems..
2. Problem Formulation
2.1. Nature-Inspired Preprocessing Optimization
2.1.1. Solution Encoding for Preprocessing Tasks
2.1.2. Self-Adaptive Solutions
- In data weighting the maximum weight can be set,
- in data sampling the mapping from the continuous value to the appearance count can be set, and
- in feature selection the mapping from the continuous value to presence or absence of the feature can be set.
2.1.3. Optimization Process for Preprocessing Tasks
2.2. Nature-Inspired Algorithms
2.3. Computation Complexity
3. EvoPreprocess Framework
- data_sampling
- data_weighting
- feature_selection
- Main task class is to be used for running tasks data sampling, feature selection or data weighting;
- Standard benchmark class, which is a default class used in the evaluation of the task, and can be replaced or extended by custom evaluation class.
3.1. Task Classes
- random_seed ensures reproducibility. The default value is the current system time in milliseconds.
- evaluator is the machine learning supervised approach used for the evaluation of preprocessed data. Here, scikit-learn compatible classifier or regressor should be used. See description of the benchmark classes for more details.
- optimizer is the optimization method used to get the preprocessed data. Here, the NiaPy compatible optimization method is expected, with the function run and the usage of the evaluation benchmark function. The default optimization method is the genetic algorithm.
- n_folds is the number of folds for the cross-validation split into the training and the validation sets. To prevent data leakage, the evaluations of optimized data samplings should be done on the hold-out validation sets. The default number of folds is set to 2 folds.
- n_runs is the number of independent runs of the optimizer on each fold. If the optimizer used is deterministic, just one run should be sufficient, otherwise more runs are suggested. The default number of runs is set to 10.
- benchmark is the evaluation class which contains the function that returns the quality of data sampling. The custom benchmark classes should be used here if the data preprocessing objective is different from the singular objective of optimizing error rate and F-score (for classification) or mean squared error (for regression).
- n_jobs is the number of optimizers to be run in parallel. The default number of jobs is set to the number of CPU cores.
- In data sampling the best performing instance occurrences in every fold are aggregated with mode.
- In data weighting the best weights in every fold are averaged.
- In feature selection the best performing selected features in every fold are aggregated with mode.
- _run is a private static function to create and run the optimizer with the provided evaluation benchmark function. Multiple calls of this function can be run in parallel.
- _reduce is a private static function used to aggregate (reduce) the results of individual runs on multiple folds in one final sampling.
Algorithm 1: Base procedure for running preprocessing optimization. |
- _get_support_mask private function, which checks if features are already selected. It overrides the function from the _BaseFilter class from scikit-learn.
- select is a score function that gets the data X and the corresponding target values y and selects the features from X. This function is provided as the scoring function for the _BaseFilter class so it can be used in scikit-learn pipelines as the feature selection function. This function returns the X_FS, which is derived from X with potentially some features removed.
3.2. Evaluation of Solutions with Benchmark Classes
- X data to be preprocessed.
- y target values for each data instance from X.
- train_indices array of indices of which instances from X should be used to train the evaluator.
- valid_indices array of indices of which instances from X should be used for validation.
- random_seed is the random seed for the evaluator and ensures reproducibility. The default value is 1234 to prevent different results from evaluators initialized at different times.
- evaluator is the machine learning supervised approach used for the evaluation of the results of the task. Here, scikit-learn compatible classifier or regressor should be used: evaluator’s function fit is used to construct the model, and function predict is used to get the predictions. If the target of the data set is numerical (regression task) the regression method should be provided, otherwise the classification method is needed. The default evaluator is None, which sets the evaluator to either linear regression if the data set target is a number, or Gaussian naive Bayes classifier if the target is nominal.
- SamplingBenchmark for data sampling,
- WeightingBenchmark for data weighting, and
- FeatureSelectionBenchmark for feature selection.
4. Examples of Use
- Python 3.6 must be installed,
- The NumPy package,
- The scikit-learn package, at least version 0.19.0,
- The imbalanced-learn package, at least version 0.3.1,
- The NiaPy package, at least version 2.0.0rc5, and
- The EvoPreprocess package can be accessed at https://github.com/karakatic/EvoPreprocess.
4.1. Data Sampling
>>>from sklearn.datasets import load_breast_cancer >>>from EvoPreprocess.data_sampling import EvoSampling >>> >>> dataset = load_breast_cancer() >>> print(dataset.data.shape, len(dataset.target)) [569,30] 569 >>> X_resampled, y_resampled = EvoSampling().fit_resample(dataset.data, dataset.target) >>> print(X_resampled.shape, len(y_resampled)) [341,30] 341
>>> from sklearn.datasets import load_boston >>> from EvoPreprocess.data_sampling import EvoSampling, SamplingBenchmark >>> >>> dataset = load_boston() >>> print(dataset.data.shape, len(dataset.target)) (506, 13) 506 >>> X_resampled, y_resampled = EvoSampling( evaluator=DecisionTreeRegressor(), optimizer=nia.Evolution strategy, n_folds=5, n_runs=5, n_jobs=4, benchmark=SamplingBenchmark ).fit_resample(dataset.data, dataset.target) >>> print(X_resampled.shape, len(y_resampled)) (703, 13) 703
4.2. Data Weighting
>>> from sklearn.datasets import load_breast_cancer >>> from EvoPreprocess.data_weighting import EvoWeighting >>> >>> dataset = load_breast_cancer() >>> instance_weights = EvoWeighting().reweight(dataset.data, dataset.target) >>> print(instance_weights) [1.568983893273244 1.2899430717992133 ... 0.7248390003761751]
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.model_selection import train_test_split >>> from sklearn.tree import DecisionTreeClassifier >>> from EvoPreprocess.data_weighting import EvoWeighting >>> >>> random_seed = 1234 >>> dataset = load_breast_cancer() >>> X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.33, random_state=random_seed) >>> cls = DecisionTreeClassifier(random_state=random_seed) >>> cls.fit(X_train, y_train) >>> >>> print(X_train.shape, accuracy_score(y_test, cls.predict(X_test)), sep=′: ′) (381, 30): 0.8936170212765957 >>> instance_weights = EvoWeighting(random_seed=random_seed).reweight(X_train, y_train) >>> cls.fit(X_train, y_train, sample_weight=instance_weights) >>> print(X_train.shape, accuracy_score(y_test, cls.predict(X_test)), sep=′: ′) (381, 30): 0.9042553191489362
4.3. Feature Selection
>>> from sklearn.datasets import load_breast_cancer >>> from EvoPreprocess.feature_selection import EvoFeatureSelection >>> >>> dataset = load_breast_cancer() >>> print(dataset.data.shape) (569, 30) >>> X_new = EvoFeatureSelection().fit_transform(dataset.data, dataset.target) >>> print(X_new.shape) (569, 17)
>>> from sklearn.datasets import load_boston >>> from sklearn.metrics import mean_squared_error >>> from sklearn.model_selection import train_test_split >>> from sklearn.tree import DecisionTreeRegressor >>> from EvoPreprocess.feature_selection import EvoFeatureSelection >>> >>> random_seed = 654 >>> dataset = load_boston() >>> X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.33, random_state=random_seed) >>> model = DecisionTreeRegressor(random_state=random_seed) >>> model.fit(X_train, y_train) >>> print(X_train.shape, mean_squared_error(y_test, model.predict(X_test)), sep=′: ′) (339, 13): 24.475748502994012 >>> evo = EvoFeatureSelection(evaluator=model, random_seed=random_seed) >>> X_train_new = evo.fit_transform(X_train, y_train) >>> >>> model.fit(X_train_new, y_train) >>> X_test_new = evo.transform(X_test) >>> print(X_train_new.shape, mean_squared_error(y_test, model.predict(X_test_new)), sep=′: ′) (339, 6): 18.03443113772455
4.4. Compatibility and Extendability
>>> from sklearn.linear_model import LinearRegression >>> from sklearn.pipeline import Pipeline >>> from sklearn.datasets import load_boston >>> from sklearn.metrics import mean_squared_error >>> from sklearn.model_selection import train_test_split >>> from sklearn.tree import DecisionTreeRegressor >>> from EvoPreprocess.feature_selection import EvoFeatureSelection >>> >>> random_seed = 987 >>> dataset = load_boston() >>> >>> X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.33, random_state=random_seed) >>> model = DecisionTreeRegressor(random_state=random_seed) >>> model.fit(X_train, y_train) >>> print(mean_squared_error(y_test, cls.predict(X_test))) 20.227544910179642 >>> pipeline = Pipeline(steps=[ (′feature_selection′, EvoFeatureSelection( evaluator=LinearRegression(), n_folds=4, n_runs=8, random_seed=random_seed)), (′regressor′, DecisionTreeRegressor(random_state=random_seed)) ]) >>> pipeline.fit(X_train, y_train) >>> print(mean_squared_error(y_test, pipeline.predict(X_test))) 19.073532934131734
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.tree import DecisionTreeClassifier >>> from imblearn.pipeline import Pipeline >>> from EvoPreprocess.feature_selection import EvoFeatureSelection >>> from EvoPreprocess.data_sampling import EvoSampling >>> random_seed = 1111 >>> dataset = load_breast_cancer() >>> X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.33, random_state=random_seed) >>> >>> cls = DecisionTreeClassifier(random_state=random_seed) >>> cls.fit(X_train, y_train) >>> print(accuracy_score(y_test, cls.predict(X_test))) 0.8829787234042553 >>> pipeline = Pipeline(steps=[ (′feature_selection′, EvoFeatureSelection(n_folds=10, random_seed=random_seed)), (′data_sampling′, EvoSampling(n_folds=10, random_seed=random_seed)), (′classifier′, DecisionTreeClassifier(random_state=random_seed))]) ]) >>> pipeline.fit(X_train, y_train) >>> print(accuracy_score(y_test, pipeline.predict(X_test))) 0.9148936170212766
>>> from sklearn.datasets import load_breast_cancer >>> from EvoPreprocess.data_sampling import EvoSampling >>> import NiaPy.algorithms.basic as nia >>> >>> dataset = load_breast_cancer() >>> print(dataset.data.shape, len(dataset.target)) [569,30] 569 >>> settings = {′NP′: 1000, ′A′: 0.5, ′r′: 0.5, ′Qmin′: 0.0, ′Qmax′: 2.0} >>> X_resampled, y_resampled = EvoSampling(optimizer=nia.Bat algorithm, optimizer_settings=settings ).fit_resample(dataset.data, dataset.target) >>> print(X_resampled.shape, len(y_resampled)) (335, 30) 335
>>> import numpy as np >>> from NiaPy.algorithms import Algorithm >>> from numpy import apply_along_axis, math >>> from sklearn.datasets import load_breast_cancer >>> from sklearn.utils import safe_indexing >>> from EvoPreprocess.data_sampling import EvoSampling >>> from EvoPreprocess.data_sampling.SamplingBenchmark import SamplingBenchmark >>> >>> class RandomSearch(Algorithm): Name = [′RandomSearch′, ′RS′] def runIteration(self, task, pop, fpop, xb, fxb, **dparams): pop = task.Lower + self.Rand.rand(self.NP, task.D) * task.bRange fpop = apply_along_axis(task.eval, 1, pop) return pop, fpop, {} >>> >>> class CustomSamplingBenchmark(SamplingBenchmark): # _________________0___1_____2______3_______4___ mapping = np.array([0.5, 0.75, 0.875, 0.9375, 1]) def function(self): def evaluate(D, sol): phenotype = SamplingBenchmark.map_to_phenotype( CustomSamplingBenchmark.to_phenotype(sol)) X_sampled = safe_indexing(self.X_train, phenotype) y_sampled = safe_indexing(self.y_train, phenotype) if X_sampled.shape[0] > 0: cls = self.evaluator.fit(X_sampled, y_sampled) y_predicted = cls.predict(self.X_valid) quality = accuracy_score(self.y_valid, y_predicted) size_percentage = len(y_sampled) / len(sol) return (1 - quality) * size_percentage else: return math.inf return evaluate @staticmethod def to_phenotype(genotype): return np.digitize(genotype[:-5], CustomSamplingBenchmark.mapping) >>> >>> dataset = load_breast_cancer() >>> print(dataset.data.shape, len(dataset.target)) (569, 30) 569 >>> X_resampled, y_resampled = EvoSampling(optimizer=RandomSearch, benchmark=CustomSamplingBenchmark ).fit_resample(dataset.data, dataset.target) >>> print(X_resampled.shape, len(y_resampled)) (311, 30) 311
5. Experiments
- 2 folds for internal cross-validation,
- 10 independent runs of meta-heuristic algorithm on each fold,
- Genetic algorithm meta-heuristic optimizer from NiaPy with the population size of 200, with 0.8 chance of crossover, 0.5 chance of mutation and 20,000 total evaluations,
- DecisionTreeClassifier for the evaluation of the solutions (denoted as CART),
- random seed was set to 1111 for all algorithms, and
- all other settings were left at their default values.
Comparison of Nature-Inspired Algorithms
6. Conclusions
Funding
Conflicts of Interest
References
- García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- García, V.; Sánchez, J.S.; Mollineda, R.A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 2012, 25, 13–21. [Google Scholar] [CrossRef]
- Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Data preprocessing for supervised leaning. Int. J. Comput. Sci. 2006, 1, 111–117. [Google Scholar]
- Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2018, 50, 94. [Google Scholar] [CrossRef] [Green Version]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution. In Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine, Cascais, Portugal, 1–4 July 2001; Springer: London, UK, 2001; pp. 63–66. [Google Scholar]
- Liu, H.; Motoda, H. Instance Selection and Construction for Data Mining; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 608. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Diao, R.; Shen, Q. Nature inspired feature selection meta-heuristics. Artif. Intell. Rev. 2015, 44, 311–340. [Google Scholar] [CrossRef] [Green Version]
- Galar, M.; Fernández, A.; Barrenechea, E.; Herrera, F. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 2013, 46, 3460–3471. [Google Scholar] [CrossRef]
- Sayed, S.; Nassef, M.; Badr, A.; Farag, I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 2019, 121, 233–243. [Google Scholar] [CrossRef]
- Ghosh, M.; Adhikary, S.; Ghosh, K.K.; Sardar, A.; Begum, S.; Sarkar, R. Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med. Biol. Eng. Comput. 2019, 57, 159–176. [Google Scholar] [CrossRef] [PubMed]
- Rao, K.N.; Reddy, C.S. A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol. Syst. 2019, 11, 119–131. [Google Scholar] [CrossRef]
- Subudhi, S.; Patro, R.N.; Biswal, P.K. Pso-based synthetic minority oversampling technique for classification of reduced hyperspectral image. In Soft Computing for Problem Solving; Springer: Berlin/Heidelberg, Germany, 2019; pp. 617–625. [Google Scholar]
- Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017, 18, 559–563. [Google Scholar]
- Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Lagani, V.; Athineou, G.; Farcomeni, A.; Tsagris, M.; Tsamardinos, I. Feature selection with the r package mxm: Discovering statistically-equivalent feature subsets. arXiv 2016, arXiv:1611.03227. [Google Scholar]
- Scrucca, L.; Raftery, A.E. clustvarsel: A Package Implementing Variable Selection for Gaussian Model-based Clustering in R. J. Stat. Softw. 2018, 84. [Google Scholar] [CrossRef] [Green Version]
- Dramiński, M.; Koronacki, J. rmcfs: An R Package for Monte Carlo Feature Selection and Interdependency Discovery. J. Stat. Softw. 2018, 85, 1–28. [Google Scholar] [CrossRef] [Green Version]
- Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79–89. [Google Scholar] [CrossRef] [Green Version]
- Liu, H.; Motoda, H. Computational Methods of Feature Selection; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
- Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 2015, 20, 606–626. [Google Scholar] [CrossRef] [Green Version]
- Brezočnik, L.; Fister, I.; Podgorelec, V. Swarm intelligence algorithms for feature selection: A review. Appl. Sci. 2018, 8, 1521. [Google Scholar] [CrossRef] [Green Version]
- Mafarja, M.M.; Mirjalili, S. Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 2017, 260, 302–312. [Google Scholar] [CrossRef]
- Mafarja, M.; Aljarah, I.; Heidari, A.A.; Faris, H.; Fournier-Viger, P.; Li, X.; Mirjalili, S. Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl.-Based Syst. 2018, 161, 185–204. [Google Scholar] [CrossRef]
- Sayed, G.I.; Tharwat, A.; Hassanien, A.E. Chaotic dragonfly algorithm: An improved metaheuristic algorithm for feature selection. Appl. Intell. 2019, 49, 188–205. [Google Scholar] [CrossRef]
- Aljarah, I.; Ala’M, A.Z.; Faris, H.; Hassonah, M.A.; Mirjalili, S.; Saadeh, H. Simultaneous feature selection and support vector machine optimization using the grasshopper optimization algorithm. Cogn. Comput. 2018, 10, 478–495. [Google Scholar] [CrossRef] [Green Version]
- Abdel-Basset, M.; El-Shahat, D.; El-henawy, I.; de Albuquerque, V.H.C.; Mirjalili, S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 2020, 139, 112824. [Google Scholar] [CrossRef]
- Al-Tashi, Q.; Kadir, S.J.A.; Rais, H.M.; Mirjalili, S.; Alhussian, H. Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection. IEEE Access 2019, 7, 39496–39508. [Google Scholar] [CrossRef]
- Zorarpacı, E.; Özel, S.A. A hybrid approach of differential evolution and artificial bee colony for feature selection. Expert Syst. Appl. 2016, 62, 91–103. [Google Scholar] [CrossRef]
- Sayed, G.I.; Hassanien, A.E.; Azar, A.T. Feature selection via a novel chaotic crow search algorithm. Neural Comput. Appl. 2019, 31, 171–188. [Google Scholar] [CrossRef]
- Gu, S.; Cheng, R.; Jin, Y. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput. 2018, 22, 811–822. [Google Scholar] [CrossRef] [Green Version]
- Dong, H.; Li, T.; Ding, R.; Sun, J. A novel hybrid genetic algorithm with granular information for feature selection and optimization. Appl. Soft Comput. 2018, 65, 33–46. [Google Scholar] [CrossRef]
- Ali, A.; Shamsuddin, S.M.; Ralescu, A.L. Classification with class imbalance problem: A review. Int. J. Adv. Soft. Comput. Appl. 2015, 7, 176–204. [Google Scholar]
- Dragusin, R.; Petcu, P.; Lioma, C.; Larsen, B.; Jørgensen, H.; Winther, O. Rare disease diagnosis as an information retrieval task. In Proceedings of the Conference on the Theory of Information Retrieval, Bertinoro, Italy, 12–14 September 2011; pp. 356–359. [Google Scholar]
- Griggs, R.C.; Batshaw, M.; Dunkle, M.; Gopal-Srivastava, R.; Kaye, E.; Krischer, J.; Nguyen, T.; Paulus, K.; Merkel, P.A. Clinical research for rare disease: Opportunities, challenges, and solutions. Mol. Genet. Metab. 2009, 96, 20–26. [Google Scholar] [CrossRef] [Green Version]
- Weigold, A.; Weigold, I.K.; Russell, E.J. Examination of the equivalence of self-report survey-based paper-and-pencil and internet data collection methods. Psychol. Methods 2013, 18, 53. [Google Scholar] [CrossRef]
- Etikan, I.; Musa, S.A.; Alkassim, R.S. Comparison of convenience sampling and purposive sampling. Am. J. Theor. Appl. Stat. 2016, 5, 1–4. [Google Scholar] [CrossRef] [Green Version]
- Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
- Triguero, I.; Galar, M.; Vluymans, S.; Cornelis, C.; Bustince, H.; Herrera, F.; Saeys, Y. Evolutionary undersampling for imbalanced big data classification. In Proceedings of the 2015 IEEE Congress on Evolutionary Computation (CEC), Sendai, Japan, 25–28 May 2015; pp. 715–722. [Google Scholar]
- Fernandes, E.; de Leon Ferreira, A.C.P.; Carvalho, D.; Yao, X. Ensemble of Classifiers based on MultiObjective Genetic Sampling for Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2019, 32, 1104–1115. [Google Scholar] [CrossRef]
- Ha, J.; Lee, J.S. A new under-sampling method using genetic algorithm for imbalanced data classification. In Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, DaNang, Vietnam, 4–6 January 2016; p. 95. [Google Scholar]
- Zhang, L.; Zhang, D. Evolutionary cost-sensitive extreme learning machine. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 3045–3060. [Google Scholar] [CrossRef]
- Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 17, pp. 973–978. [Google Scholar]
- Yang, X.S. Nature-Inspired Optimization Algorithms; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Fister, I., Jr.; Yang, X.S.; Fister, I.; Brest, J.; Fister, D. A brief review of nature-inspired algorithms for optimization. arXiv 2013, arXiv:1307.4186. [Google Scholar]
- Yang, X.S.; Cui, Z.; Xiao, R.; Gandomi, A.H.; Karamanoglu, M. Swarm Intelligence and Bio-Inspired Computation: Theory and Applications; Newnes: Newton, MA, USA, 2013. [Google Scholar]
- Pardalos, P.M.; Prokopyev, O.A.; Busygin, S. Continuous approaches for solving discrete optimization problems. In Handbook on Modelling for Discrete Optimization; Springer: Berlin/Heidelberg, Germany, 2006; pp. 39–60. [Google Scholar]
- Fister, D.; Fister, I.; Jagrič, T.; Brest, J. Wrapper-Based Feature Selection Using Self-adaptive Differential Evolution. In Swarm, Evolutionary, and Memetic Computing and Fuzzy and Neural Computing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 135–154. [Google Scholar]
- Ghosh, A.; Datta, A.; Ghosh, S. Self-adaptive differential evolution for feature selection in hyperspectral image data. Appl. Soft Comput. 2013, 13, 1969–1977. [Google Scholar] [CrossRef]
- Tao, X.; Li, Q.; Guo, W.; Ren, C.; Li, C.; Liu, R.; Zou, J. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf. Sci. 2019, 487, 31–56. [Google Scholar] [CrossRef]
- Brest, J.; Greiner, S.; Boskovic, B.; Mernik, M.; Zumer, V. Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE Trans. Evol. Comput. 2006, 10, 646–657. [Google Scholar] [CrossRef]
- Zainudin, M.; Sulaiman, M.; Mustapha, N.; Perumal, T.; Nazri, A.; Mohamed, R.; Manaf, S. Feature selection optimization using hybrid relief-f with self-adaptive differential evolution. Int. J. Intell. Eng. Syst. 2017, 10, 21–29. [Google Scholar] [CrossRef]
- Xue, Y.; Xue, B.; Zhang, M. Self-adaptive particle swarm optimization for large-scale feature selection in classification. ACM Trans. Knowl. Discov. Data (TKDD) 2019, 13, 1–27. [Google Scholar] [CrossRef]
- Fister, D.; Fister, I.; Jagrič, T.; Brest, J. A novel self-adaptive differential evolution for feature selection using threshold mechanism. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 17–24. [Google Scholar]
- Mafarja, M.; Mirjalili, S. Whale optimization approaches for wrapper feature selection. Appl. Soft Comput. 2018, 62, 441–453. [Google Scholar] [CrossRef]
- Soufan, O.; Kleftogiannis, D.; Kalnis, P.; Bajic, V.B. DWFS: A wrapper feature selection tool based on a parallel genetic algorithm. PLoS ONE 2015, 10, e0117988. [Google Scholar] [CrossRef]
- Mafarja, M.; Eleyan, D.; Abdullah, S.; Mirjalili, S. S-shaped vs. V-shaped transfer functions for ant lion optimization algorithm in feature selection problem. In Proceedings of the international conference on future networks and distributed systems, Cambridge, UK, 19–20 July 2017; p. 21. [Google Scholar]
- Ghareb, A.S.; Bakar, A.A.; Hamdan, A.R. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 2016, 49, 31–47. [Google Scholar] [CrossRef]
- Lones, M.A. Mitigating metaphors: A comprehensible guide to recent nature-inspired algorithms. SN Comput. Sci. 2020, 1, 49. [Google Scholar] [CrossRef] [Green Version]
- Połap, D. Polar bear optimization algorithm: Meta-heuristic with fast population movement and dynamic birth and death mechanism. Symmetry 2017, 9, 203. [Google Scholar] [CrossRef] [Green Version]
- Kazikova, A.; Pluhacek, M.; Senkerik, R.; Viktorin, A. Proposal of a New Swarm Optimization Method Inspired in Bison Behavior. Recent Adv. Soft. Comput. 2019, 146–156. [Google Scholar] [CrossRef]
- Arora, S.; Singh, S. Butterfly optimization algorithm: A novel approach for global optimization. Soft. Comput. 2019, 23, 715–734. [Google Scholar] [CrossRef]
- Klein, C.E.; Mariani, V.C.; dos Santos Coelho, L. Cheetah Based Optimization Algorithm: A Novel Swarm Intelligence Paradigm; ESANN: Bruges, Belgium, 2018. [Google Scholar]
- Pierezan, J.; Coelho, L.D.S. Coyote optimization algorithm: A new metaheuristic for global optimization problems. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Jain, M.; Singh, V.; Rani, A. A novel nature-inspired algorithm for optimization: Squirrel search algorithm. Swarm Evol. Comput. 2019, 44, 148–175. [Google Scholar] [CrossRef]
- Sampson, J.R. Adaptation in Natural and Artificial Systems; Holland, J.H., Ed.; The MIT Press: Cambridge, MA, USA, 1976. [Google Scholar]
- Storn, R.; Price, K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
- Beyer, H.G.; Schwefel, H.P. Evolution strategies—A comprehensive introduction. Nat. Comput. 2002, 1, 3–52. [Google Scholar] [CrossRef]
- Yang, X.S. Harmony search as a metaheuristic algorithm. In Music-Inspired Harmony Search Algorithm; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–14. [Google Scholar]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Karaboga, D.; Basturk, B. A powerful and efficient algorithm for numerical function optimization: Artificial bee colony (ABC) algorithm. J. Glob. Optim. 2007, 39, 459–471. [Google Scholar] [CrossRef]
- Yang, X.S.; Gandomi, A.H. Bat algorithm: A novel approach for global engineering optimization. Eng. Comput. 2012. [Google Scholar] [CrossRef] [Green Version]
- Yang, X.S.; Deb, S. Cuckoo search via Lévy flights. In Proceedings of the 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), Coimbatore, India, 9–11 December 2009; pp. 210–214. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001; Volume 1. [Google Scholar]
- Oliphant, T. NumPy: A guide to NumPy; Trelgol Publishing: Spanish Fork, UT, USA, 2006. [Google Scholar]
- McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 51–56. [Google Scholar]
- Yang, X.S. A new metaheuristic bat-inspired algorithm. In Nature Inspired Cooperative Strategies for Optimization (NICSO 2010); Springer: Berlin/Heidelberg, Germany, 2010; pp. 65–74. [Google Scholar]
- Dias, D.B.; Madeo, R.C.; Rocha, T.; Bíscaro, H.H.; Peres, S.M. Hand movement recognition for brazilian sign language: A study using distance-based neural networks. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 697–704. [Google Scholar]
- Calzolari, M. Manuel-Calzolari/Sklearn-Genetic: Sklearn-Genetic 0.2; Zenodo: Geneva, Switzerland, 2019. [Google Scholar] [CrossRef]
- Reeves, C.R. Landscapes, operators and heuristic search. Ann. Oper. Res. 1999, 86, 473–490. [Google Scholar] [CrossRef]
- Yang, Z.; Tang, K.; Yao, X. Large scale evolutionary optimization using cooperative coevolution. Inf. Sci. 2008, 178, 2985–2999. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Ishikawa, M. An extended hybrid genetic algorithm for exploring a large search space. In Proceedings of the 2nd International Conference on Autonomous Robots and Agents, Kyoto, Japan, 10–11 December 2004; pp. 244–248. [Google Scholar]
- Siedlecki, W.; Sklansky, J. A note on genetic algorithms for large-scale feature selection. In Handbook of Pattern Recognition and Computer Vision; World Scientific: London, UK, 1993; pp. 88–107. [Google Scholar]
CART | EvoPreprocess | |||
---|---|---|---|---|
Fold | Acc | Fsc | Acc | Fsc |
1 | 83.56 | 33.33 | 87.67 | 47.06 |
2 | 90.28 | 36.36 | 95.83 | 57.14 |
3 | 88.89 | 20.00 | 91.67 | 57.14 |
4 | 91.67 | 25.00 | 88.89 | 42.86 |
5 | 85.92 | 0.00 | 88.73 | 0.00 |
Average | 88.06 | 22.94 | 90.56 | 40.84 |
Median | 88.89 | 25.00 | 88.89 | 47.06 |
Average rank | 1.8 | 1.8 | 1.2 | 1.0 |
CART | chi2 | ANOVA F | Mutual Information | Sklearn Genetic | EvoPreprocess | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Fold | Acc | Fsc | Acc | Fsc | Acc | Fsc | Acc | Fsc | Acc | Fsc | Acc | Fsc |
1 | 84.93 | 35.29 | 97.26 | 80.00 | 98.63 | 88.89 | 90.41 | 53.33 | 82.19 | 31.58 | 97.26 | 80.00 |
2 | 83.33 | 33.33 | 94.44 | 33.33 | 94.44 | 33.33 | 95.83 | 57.14 | 94.44 | 60.00 | 95.83 | 57.14 |
3 | 93.06 | 28.57 | 91.67 | 0.00 | 87.50 | 0.00 | 86.11 | 16.67 | 91.67 | 25.00 | 95.83 | 57.14 |
4 | 86.11 | 0.00 | 93.06 | 61.54 | 91.67 | 40.00 | 93.06 | 54.55 | 91.67 | 50.00 | 94.44 | 50.00 |
5 | 81.69 | 0.00 | 83.10 | 14.29 | 90.14 | 0.00 | 84.51 | 35.29 | 87.32 | 0.00 | 90.14 | 0.00 |
Average | 85.82 | 19.44 | 91.91 | 37.83 | 92.48 | 32.44 | 89.98 | 43.40 | 89.46 | 33.32 | 94.70 | 48.86 |
Median | 84.93 | 28.57 | 93.06 | 33.33 | 91.67 | 33.33 | 90.41 | 53.33 | 90.56 | 32.45 | 95.83 | 57.14 |
Avg. rank | 5.0 | 4.0 | 3.0 | 2.8 | 2.8 | 3.6 | 3.4 | 2.6 | 3.8 | 3.2 | 1.2 | 2.2 |
CART | Under-Sampling Tomek Links | Over-Sampling SMOTE | Over-under- Sampling SMOTE with Tomek | EvoPreprocess | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Fold | Acc | Fsc | Acc | Fsc | Acc | Fsc | Acc | Fsc | Acc | Fsc |
1 | 84.93 | 35.29 | 84.93 | 35.29 | 94.52 | 66.67 | 94.52 | 66.67 | 84.93 | 0.00 |
2 | 83.33 | 33.33 | 83.33 | 33.33 | 97.22 | 75.00 | 97.22 | 75.00 | 94.44 | 33.33 |
3 | 93.06 | 28.57 | 93.06 | 28.57 | 88.89 | 33.33 | 88.89 | 33.33 | 94.44 | 33.33 |
4 | 86.11 | 0.00 | 86.11 | 0.00 | 93.06 | 54.55 | 93.06 | 54.55 | 87.50 | 40.00 |
5 | 81.69 | 0.00 | 81.69 | 0.00 | 92.96 | 61.54 | 92.96 | 61.54 | 94.37 | 0.00 |
Average | 85.82 | 19.44 | 85.82 | 19.44 | 93.33 | 58.22 | 93.33 | 58.22 | 91.14 | 21.33 |
Median | 84.93 | 28.57 | 84.93 | 28.57 | 93.06 | 61.54 | 93.06 | 61.54 | 94.37 | 33.33 |
Avg. rank | 3.4 | 3.4 | 3.4 | 3.4 | 1.8 | 1.0 | 1.8 | 1.0 | 2.2 | 3 |
Task | Algorithm | Acc | Fsc | Computation Time (s) with Four Parallel Runs | Computation Time (s) with no Parallel Runs |
---|---|---|---|---|---|
Feature selection | Artificial bee colony | 86.24 | 66.02 | 15.0062 | 42.1964 |
Bat algorithm | 85.24 | 64.12 | 15.2828 | 47.6856 | |
Cuckoo search | 84.49 | 60.80 | 13.8600 | 35.9927 | |
Differential evolution | 86.52 | 66.98 | 16.2162 | 61.5314 | |
Evolution strategy | 87.01 | 68.72 | 15.2992 | 50.4924 | |
Genetic algorithm | 86.51 | 67.50 | 15.0095 | 45.0393 | |
Harmony search | 85.76 | 65.99 | 15.9159 | 50.2762 | |
Particle swarm optimization | 85.25 | 63.66 | 14.8623 | 52.1369 | |
Data sampling | Artificial bee colony | 85.50 | 64.27 | 19.4990 | 61.4075 |
Bat algorithm | 87.01 | 67.85 | 19.0993 | 58.1645 | |
Cuckoo search | 85.50 | 63.95 | 18.9420 | 53.1541 | |
Differential evolution | 86.25 | 65.33 | 23.4585 | 67.7465 | |
Evolution strategy | 84.00 | 61.56 | 19.2668 | 60.3277 | |
Genetic algorithm | 85.02 | 62.89 | 24.8603 | 73.1325 | |
Harmony search | 86.01 | 65.59 | 39.1134 | 105.1702 | |
Particle swarm optimization | 85.25 | 64.30 | 18.7993 | 52.8433 | |
Data weighting | Artificial bee colony | 87.76 | 69.98 | 16.7115 | 43.9391 |
Bat algorithm | 86.25 | 66.47 | 14.7816 | 37.3560 | |
Cuckoo search | 86.74 | 67.55 | 15.9384 | 42.0680 | |
Differential evolution | 85.52 | 65.64 | 18.8684 | 51.4843 | |
Evolution strategy | 86.00 | 67.16 | 15.6373 | 72.5965 | |
Genetic algorithm | 86.24 | 64.35 | 22.0804 | 62.2260 | |
Harmony search | 87.01 | 67.04 | 37.2622 | 109.2218 | |
Particle swarm optimization | 88.00 | 68.76 | 15.5451 | 40.0671 |
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karakatič, S. EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms. Mathematics 2020, 8, 900. https://doi.org/10.3390/math8060900
Karakatič S. EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms. Mathematics. 2020; 8(6):900. https://doi.org/10.3390/math8060900
Chicago/Turabian StyleKarakatič, Sašo. 2020. "EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms" Mathematics 8, no. 6: 900. https://doi.org/10.3390/math8060900
APA StyleKarakatič, S. (2020). EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms. Mathematics, 8(6), 900. https://doi.org/10.3390/math8060900