A Bootstrap Framework for Aggregating within and between Feature Selection Methods
Abstract
:1. Introduction
2. Bootstrap Aggregation Framework
- -
- The Within Aggregation Method (WAM) refers to aggregating importance scores within a single feature selection method.
- -
- The Between Aggregation Method (BAM) refers to aggregating importance scores between different feature selection methods.
2.1. Feature Selection Based on WAM
Algorithm 1WAM Algorithm: |
|
2.2. Feature Selection Based on BAM
Algorithm 2BAM Algorithm: |
|
3. Stability Analysis
- An importance scores vector .
- A rank vector .
- A subset of features represented by an index vector , where 1 indicates feature presence and 0 indicates feature absence.
- i.
- Pearson’s correlation coefficient: In the case of similarity between two importance score vectors produced by one of the feature selection methods, the Pearson’s correlation coefficient computes the similarity measure as
- ii.
- Spearman’s rank correlation coefficient: With regard to the similarity between two rank vectors produced by one of the feature selection methods, Spearman’s rank correlation coefficient measures the similarity between the two rank vectors as
- iii.
- Canberra’s distance: Another measure used to quantify the similarity between two rank vectors is Canberra’s distance [19]. This metric represents the absolute difference between two rank vectors as
- iv.
- Jaccard’s index: Jaccard’s index measures the similarity between two finite sets; it is taken as the size of the intersection divided by the size of the union of the two sets. Given the index vectors used to represent the two sets, Jaccard’s index is given by
4. Experimental Evaluation
4.1. Experimental Datasets
4.2. Experimental Design
- Logistic regression: A statistical model used to model the probability of the occurrence of a class or an event using a logistic (sigmoid) function. It is a widely used classification algorithm in machine learning. The objective of logistic regression is to analyze the relationship between the categorical dependent variable and a set of independent variables (features) in order to predict the probability of the target variable. The maximum likelihood estimation method is usually used to estimate the logistic regression coefficients.
- Naive Bayes: A probabilistic classifier based on the Bayes theorem [28]. It assumes that the occurrence of each input feature is independent from other features. It can be used for both binary and multiclass classification problems. Due to its simplicity, it is a fast machine learning algorithm which can be used with large datasets.
- Random Forest: An ensemble model of decision trees in which every tree is trained on a random subsample to provide class prediction. The subsamples are drawn with replacements from the training dataset. The results from all the decision trees are then averaged to yield the model prediction [29]. Random Forest is useful to prevent over-fitting, but it can be complex to implement.
- Support Vector Machine (SVM): A supervised learning algorithm in which each observation is plotted as a point in p-dimensional space (with p being the number of features). SVM aims to identify the optimal hyperplane which segregates the data into separate classes. The selected hyperplanes thus maximize the distance between data points of different classes [30].
4.3. Discussion of the Results
4.3.1. Classification Performance
4.3.2. Stability Analysis Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
Appendix A
References
- Sulieman, H.; Alzaatreh, A. A Supervised Feature Selection Approach Based on Global Sensitivity. Arch. Data Sci. Ser. A (Online First) 2018, 5, 3. [Google Scholar]
- Bertolazzi, P.; Felici, G.; Festa, P.; Fiscon, G.; Weitschek, E. Integer programming models for feature selection: New extensions and a randomized solution algorithm. Eur. J. Oper. Res. 2016, 250, 389–399. [Google Scholar] [CrossRef]
- González-Navarro, F. Review and evaluation of feature selection algorithms in synthetic problems. CORR 2011, 1101, 2320. [Google Scholar]
- Liu, Y.; Schumann, M. Data mining feature selection for credit scoring models. J. Oper. Res. Soc. 2005, 56, 1099–1108. [Google Scholar] [CrossRef]
- Lemke, C.; Budka, M.; Gabrys, B. Metalearning: A survey of trends and technologies. Artif. Intell. Rev. 2015, 44, 117–130. [Google Scholar] [CrossRef] [Green Version]
- Parmezan, A.R.S.; Lee, H.D.; Wu, F.C. Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework. Expert Syst. Appl. 2017, 75, 1–24. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
- Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ. Comput. Inf. Sci. 2019. [Google Scholar] [CrossRef]
- Chatterjee, S. The scale enhanced wild bootstrap method for evaluating climate models using wavelets. Stat. Probab. Lett. 2019, 144, 69–73. [Google Scholar] [CrossRef]
- Abeel, T.; Helleputte, T.; Van de Peer, Y.; Dupont, P.; Saeys, Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010, 26, 392–398. [Google Scholar] [CrossRef]
- Zhou, Q.; Ding, J.; Ning, Y.; Luo, L.; Li, T. Stable feature selection with ensembles of multi-relieff. In Proceedings of the 2014 10th International Conference on Natural Computation (ICNC), Xiamen, China, 19–21 August 2014; pp. 742–747. [Google Scholar]
- Diren, D.D.; Boran, S.; Selvi, I.H.; Hatipoglu, T. Root cause detection with an ensemble machine learning approach in the multivariate manufacturing process. In Industrial Engineering in the Big Data Era; Springer: New York, NY, USA, 2019; pp. 163–174. [Google Scholar]
- Shen, Q.; Diao, R.; Su, P. Feature Selection Ensemble. Turing-100 2012, 10, 289–306. [Google Scholar]
- Wald, R.; Khoshgoftaar, T.M.; Dittman, D. Mean aggregation versus robust rank aggregation for ensemble gene selection. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 12–15 December 2012; Volume 1, pp. 63–69. [Google Scholar]
- Kolde, R.; Laur, S.; Adler, P.; Vilo, J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 2012, 28, 573–580. [Google Scholar] [CrossRef] [Green Version]
- Ditzler, G.; Polikar, R.; Rosen, G. A bootstrap based neyman-pearson test for identifying variable importance. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 880–886. [Google Scholar] [CrossRef] [PubMed]
- Goh, W.W.B.; Wong, L. Evaluating feature-selection stability in next-generation proteomics. J. Bioinform. Comput. Biol. 2016, 14, 1650029. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kalousis, A.; Prados, J.; Hilario, M. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowl. Inf. Syst. 2007, 12, 95–116. [Google Scholar] [CrossRef] [Green Version]
- Jurman, G.; Riccadonna, S.; Visintainer, R.; Furlanello, C. Canberra distance on ranked lists. In Proceedings of the Advances in Ranking NIPS 09 Workshop, Citeseer, Whistler, BC, Canada, 11 December 2009; pp. 22–27. [Google Scholar]
- Shen, Z.; Chen, X.; Garibaldi, J.M. A Novel Weighted Combination Method for Feature Selection using Fuzzy Sets. In Proceedings of the 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), New Orleans, LA, USA, 23–26 June 2019; pp. 1–6. [Google Scholar]
- Seijo-Pardo, B.; Bolón-Canedo, V.; Alonso-Betanzos, A. On developing an automatic threshold applied to feature selection ensembles. Inf. Fusion 2019, 45, 227–245. [Google Scholar] [CrossRef]
- Seijo-Pardo, B.; Bolón-Canedo, V.; Alonso-Betanzos, A. Testing different ensemble configurations for feature selection. Neural Process. Lett. 2017, 46, 857–880. [Google Scholar] [CrossRef]
- Khoshgoftaar, T.M.; Golawala, M.; Van Hulse, J. An empirical study of learning from imbalanced data using random forest. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Patras, Greece, 29–31 October 2007; Voluem 2, pp. 310–317. [Google Scholar]
- Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2013, 34, 483–519. [Google Scholar] [CrossRef]
- Hua, J.; Xiong, Z.; Lowey, J.; Suh, E.; Dougherty, E.R. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2005, 21, 1509–1515. [Google Scholar] [CrossRef] [Green Version]
- Sánchez-Marono, N.; Alonso-Betanzos, A.; Tombilla-Sanromán, M. Filter methods for feature selection–a comparative study. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Birmingham, UK, 16–19 December 2007; pp. 178–187. [Google Scholar]
- Wang, J.; Xu, J.; Zhao, C.; Peng, Y.; Wang, H. An ensemble feature selection method for high-dimensional data based on sort aggregation. Syst. Sci. Control Eng. 2019, 7, 32–39. [Google Scholar] [CrossRef]
- John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. arXiv 2013, arXiv:1302.4964. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27:1–27:27. [Google Scholar] [CrossRef]
- Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T. ROCR: Visualizing classifier performance in R. Bioinformatics 2005, 21, 3940–3941. [Google Scholar] [CrossRef] [PubMed]
Dataset Name and Source | No. Observations | No. Features | No. Classes | Dimensionality * |
---|---|---|---|---|
Jasmine | 2984 (1492/1492) | 145 | 2 | 0.048592 |
Spectrometer | 531 (476/55) | 103 | 2 | 0.193974 |
Image | 2000 (1420/580) | 140 | 2 | 0.07 |
Fri | 1000 (564/436) | 101 | 2 | 0.101 |
Scene | 2407 (1976/431) | 295 | 2 | 0.122559 |
Musk | 6598 (5581/1017) | 170 | 2 | 0.025765 |
Philippine | 5832 (2916/2916) | 309 | 2 | 0.052984 |
Ionosphere | 351 (126/225) | 34 | 2 | 0.096866 |
Optdigits | 5620 (572/5048) | 64 | 2 | 0.011388 |
Satellite | 5100 (75/5025) | 37 | 2 | 0.007255 |
Ada | 4147 (1029/3118) | 49 | 2 | 0.011816 |
Splice | 3190 (1535/1655) | 62 | 2 | 0.019436 |
HIVA | 4229 (149/4080) | 1617 | 2 | 0.382359 |
Dataset | Stability Measure | Information Gain | Symmetrical Uncertainty | MRMR | Chi-Squared | BAM |
---|---|---|---|---|---|---|
Jasmine | Average Pearson Correlation | 0.299705 | 0.258270 | 0.902748 | 0.360289 | 0.335806 |
Average Spearman Rank Correlation | 0.319009 | 0.378017 | 0.730414 | 0.231101 | 0.334655 | |
Average Jaccard’s Index | 0.254683 | 0.319411 | 0.308712 | 0.286586 | 0.252964 | |
Average Canberra Distance | 0.278113 | 0.269500 | 0.124180 | 0.337122 | 0.294955 | |
Average Standard Deviation | 0.744817 | 0.753504 | 0.149429 | 0.747054 | 0.765530 | |
Spectrometer | Average Pearson Correlation | 0.898348 | 0.942554 | 0.698045 | 0.916711 | 0.927582 |
Average Spearman Rank Correlation | 0.831602 | 0.837903 | 0.765404 | 0.818745 | 0.833578 | |
Average Jaccard’s Index | 0.759908 | 0.917055 | 0.436905 | 0.851790 | 0.872271 | |
Average Canberra Distance | 0.153674 | 0.152607 | 0.171177 | 0.186988 | 0.182109 | |
Average Standard Deviation | 0.308687 | 0.223475 | 0.456242 | 0.281311 | 0.258388 | |
Image | Average Pearson Correlation | 0.768404 | 0.760257 | 0.461867 | 0.784970 | 0.793634 |
Average Spearman Rank Correlation | 0.702970 | 0.690212 | 0.541442 | 0.716971 | 0.671209 | |
Average Jaccard’s Index | 0.534739 | 0.514374 | 0.275411 | 0.557456 | 0.560470 | |
Average Canberra Distance | 0.140367 | 0.142500 | 0.236440 | 0.242640 | 0.254949 | |
Average Standard Deviation | 0.437907 | 0.462153 | 0.643365 | 0.459268 | 0.434591 | |
Fri | Average Pearson Correlation | 0.969642 | 0.912108 | 0.955966 | 0.818795 | 0.948385 |
Average Spearman Rank Correlation | 0.458619 | 0.459933 | 0.301028 | 0.327351 | 0.308725 | |
Average Jaccard’s Index | 0.600791 | 0.600791 | 0.263723 | 0.288660 | 0.280655 | |
Average Canberra Distance | 0.057812 | 0.057752 | 0.329030 | 0.328557 | 0.330035 | |
Average Standard Deviation | 0.146339 | 0.268018 | 0.201306 | 0.422568 | 0.218823 | |
Scene | Average Pearson Correlation | 0.898425 | 0.887953 | 0.652895 | 0.933014 | 0.908673 |
Average Spearman Rank Correlation | 0.871633 | 0.863620 | 0.705409 | 0.904599 | 0.881263 | |
Average Jaccard’s Index | 0.725580 | 0.718157 | 0.429004 | 0.834032 | 0.761022 | |
Average Canberra Distance | 0.150622 | 0.156575 | 0.206240 | 0.169175 | 0.182344 | |
Average Standard Deviation | 0.309285 | 0.328335 | 0.501868 | 0.253048 | 0.296812 | |
Musk | Average Pearson Correlation | 0.953028 | 0.939819 | 0.983622 | 0.972910 | 0.971086 |
Average Spearman Rank Correlation | 0.897172 | 0.920754 | 0.978164 | 0.958189 | 0.932817 | |
Average Jaccard’s Index | 0.254683 | 0.319411 | 0.308712 | 0.286586 | 0.252964 | |
Average Canberra Distance | 0.278113 | 0.269500 | 0.124180 | 0.337122 | 0.294955 | |
Average Standard Deviation | 0.198588 | 0.230549 | 0.106096 | 0.153881 | 0.164122 | |
Philippine | Average Pearson Correlation | 0.992381 | 0.987185 | 0.949337 | 0.974312 | 0.990140 |
Average Spearman Rank Correlation | 0.948322 | 0.945942 | 0.876291 | 0.794429 | 0.826292 | |
Average Jaccard’s Index | 0.907578 | 0.895855 | 0.599476 | 0.882073 | 0.898057 | |
Average Canberra Distance | 0.036655 | 0.037883 | 0.123565 | 0.216403 | 0.199559 | |
Average Standard Deviation | 0.065557 | 0.093865 | 0.194033 | 0.133756 | 0.090117 | |
Ionosphere | Average Pearson Correlation | 0.398351 | 0.583203 | 0.803445 | 0.689003 | 0.678480 |
Average Spearman Rank Correlation | 0.391580 | 0.583566 | 0.779300 | 0.634363 | 0.621247 | |
Average Jaccard’s Index | 0.322984 | 0.418490 | 0.588096 | 0.511871 | 0.514258 | |
Average Canberra Distance | 0.284348 | 0.254220 | 0.185660 | 0.245600 | 0.249635 | |
Average Standard Deviation | 0.731482 | 0.606503 | 0.397275 | 0.546923 | 0.549206 | |
Optdigits | Average Pearson Correlation | 0.974733 | 0.956047 | 0.946192 | 0.978112 | 0.976264 |
Average Spearman Rank Correlation | 0.965357 | 0.959320 | 0.913443 | 0.968890 | 0.967125 | |
Average Jaccard’s Index | 0.776935 | 0.687190 | 0.621800 | 0.699440 | 0.740367 | |
Average Canberra Distance | 0.087271 | 0.094572 | 0.112731 | 0.077847 | 0.090535 | |
Average Standard Deviation | 0.150498 | 0.196308 | 0.188786 | 0.141531 | 0.146916 | |
Satellite | Average Pearson Correlation | 0.962102 | 0.735536 | 0.735555 | 0.962324 | 0.932846 |
Average Spearman Rank Correlation | 0.913703 | 0.737037 | 0.886279 | 0.941141 | 0.912465 | |
Average Jaccard’s Index | 0.889171 | 0.523733 | 0.579217 | 0.711644 | 0.540189 | |
Average Canberra Distance | 0.128159 | 0.206656 | 0.093391 | 0.117366 | 0.126052 | |
Average Standard Deviation | 0.189240 | 0.429693 | 0.289669 | 0.186215 | 0.235742 | |
Ada | Average Pearson Correlation | 0.998732 | 0.998655 | 0.997906 | 0.992348 | 0.998004 |
Average Spearman Rank Correlation | 0.956392 | 0.952797 | 0.823995 | 0.955461 | 0.952162 | |
Average Jaccard’s Index | 0.919947 | 0.866222 | 0.607214 | 0.830739 | 0.863409 | |
Average Canberra Distance | 0.106299 | 0.108989 | 0.155215 | 0.122942 | 0.125797 | |
Average Standard Deviation | 0.028835 | 0.031522 | 0.029137 | 0.083938 | 0.042076 | |
Splice | Average Pearson Correlation | 0.992299 | 0.993156 | 0.990926 | 0.974606 | 0.989386 |
Average Spearman Rank Correlation | 0.841882 | 0.842385 | 0.738889 | 0.843115 | 0.836391 | |
Average Jaccard’s Index | 0.761442 | 0.762814 | 0.597770 | 0.760529 | 0.742453 | |
Average Canberra Distance | 0.187747 | 0.187556 | 0.224846 | 0.187334 | 0.190992 | |
Average Standard Deviation | 0.070557 | 0.065447 | 0.081051 | 0.157357 | 0.096031 | |
HIVA | Average Pearson Correlation | 0.738293 | 0.764771 | 0.866538 | 0.746545 | 0.746723 |
Average Spearman Rank Correlation | 0.603392 | 0.621829 | 0.804467 | 0.648684 | 0.639793 | |
Average Jaccard’s Index | 0.654280 | 0.583914 | 0.752230 | 0.618569 | 0.623986 | |
Average Canberra Distance | 0.277170 | 0.2369058 | 0.147782 | 0.260381 | 0.252760 | |
Average Standard Deviation | 0.457420 | 0.395126 | 0.318563 | 0.466903 | 0.456952 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Salman, R.; Alzaatreh, A.; Sulieman, H.; Faisal, S. A Bootstrap Framework for Aggregating within and between Feature Selection Methods. Entropy 2021, 23, 200. https://doi.org/10.3390/e23020200
Salman R, Alzaatreh A, Sulieman H, Faisal S. A Bootstrap Framework for Aggregating within and between Feature Selection Methods. Entropy. 2021; 23(2):200. https://doi.org/10.3390/e23020200
Chicago/Turabian StyleSalman, Reem, Ayman Alzaatreh, Hana Sulieman, and Shaimaa Faisal. 2021. "A Bootstrap Framework for Aggregating within and between Feature Selection Methods" Entropy 23, no. 2: 200. https://doi.org/10.3390/e23020200
APA StyleSalman, R., Alzaatreh, A., Sulieman, H., & Faisal, S. (2021). A Bootstrap Framework for Aggregating within and between Feature Selection Methods. Entropy, 23(2), 200. https://doi.org/10.3390/e23020200