Distance Correlation-Based Feature Selection in Random Forest
Abstract
:1. Introduction
2. Main Results
2.1. Feature Selection Method in Random Forest
Algorithm 1 Proposed DC-based Method |
Given a training data set and the distance correlation set of length s,
|
2.2. Theoretical Results
3. Simulation Study
- Under settings 1 & 2, we consider the following model
- –
- Setting 1: Generate from a normal distribution: , where , with and 0.8.
- –
- Setting 2: Generate from a normal distribution: , where , with
- Under setting 3, we consider the following model
- –
- Setting 3: Generate from a normal distribution: , where , with .
- Under setting 4, we consider the following model
- –
- Setting 4: Generate from .
3.1. Analysis of the Linear Models
3.2. Analysis of the Nonlinear Model
4. Applications
- Riboflavin Data:This dataset contains riboflavin production by Bacillus subtilis. There are observations of predictors (gene expressions) and a one-dimensional response variable.
- Boston Housing Data:This dataset contains housing data for 506 census tracts of Boston from the 1970 census. There are observations of predictors.
4.1. Riboflavin Data
4.2. Boston Housing Data
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Hall, M.A. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
- Dash, M.; Choi, K.; Scheuermann, P.; Liu, H. Feature selection for clustering—A filter solution. In Proceedings of the Second International Conference on Data Mining, Arlington, VA, USA, 11–13 April 2002; pp. 115–122. [Google Scholar]
- Caruana, R.; Freitag, D. Greedy attribute selection. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 28–36. [Google Scholar]
- Dy, J.G.; Brodley, C.E. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 247–254. [Google Scholar]
- Ng, A.Y. On feature selection: Learning with exponentially many irrelevant features as training examples. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; pp. 404–412. [Google Scholar]
- Das, S. Filters, wrappers and a boosting-based hybrid for feature selection. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 74–81. [Google Scholar]
- Xing, E.; Jordan, M.; Karp, R. Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 601–608. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1428. [Google Scholar] [CrossRef]
- Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Pearson, K. Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. Lond. Ser. 1896, 187, 253–318. [Google Scholar]
- Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
- Hsu, H.H.; Hsieh, C.W. Feature Selection via Correlation Coefficient Clustering. J. Softw. 2010, 5, 1371–1377. [Google Scholar] [CrossRef]
- Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily activity feature selection in smart homes based on pearson correlation coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
- Breiman, L. Random Forest. Technical Report; University of California: Berkeley, CA, USA, 2001. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning—Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
- Biau, G.; Devroye, L.; Lugosi, G. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 2008, 9, 2015–2033. [Google Scholar]
- Zhu, R.; Zeng, D.; Kosorok, M. Reinforcement learning trees. J. Am. Stat. Assoc. 2015, 110, 1770–1784. [Google Scholar] [CrossRef] [PubMed]
- Wonkye, Y.T. Innovations of Random Forests for Longitudinal Data. Ph.D. Thesis, Bowling Green State University, OhioLINK Electronic Theses and Dissertations Center, Bowling Green, OH, USA, 2019. [Google Scholar]
- Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing independence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
- Székely, G.J.; Rizzo, M.L. Brownian distance covariance. Ann. Appl. Stat. 2009, 3, 1236–1265. [Google Scholar] [CrossRef] [PubMed]
- Das, R.; Kasieczka, G.; Shih, D. Feature Selection with Distance Correlation. arXiv 2022, arXiv:2212.00046. [Google Scholar]
- Bühlmann, P.; Kalisch, M.; Meier, L. High-dimensional statistics with a view toward applications in biology. Annu. Rev. Stat. Appl. 2014, 1, 255–278. [Google Scholar] [CrossRef]
- Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
Method | |||||
---|---|---|---|---|---|
Traditional RF | 30.4468 | 32.3146 | 37.0157 | 39.9092 | |
No | RLTNo1 | 17.1149 | 18.2449 | 20.6827 | 22.2395 |
RLTNo2 | 8.3586 | 9.2965 | 10.8636 | 12.1497 | |
RLTNo5 | 5.9539 | 6.8420 | 8.4067 | 9.5437 | |
Moderate | RLTMod1 | 23.5688 | 24.9247 | 29.2962 | 31.4494 |
RLTMod2 | 12.7399 | 13.8862 | 16.9476 | 19.1914 | |
RLTMod5 | 9.7806 | 10.9047 | 13.5140 | 15.6142 | |
CC | 0 | 30.4568 | 32.3099 | 36.9560 | 39.9454 |
0.1 | 22.8696 | 24.6372 | 29.9454 | 33.0442 | |
0.2 | 16.5787 | 16.7887 | 16.9566 | 18.2652 | |
0.3 | 15.9218 | 15.8904 | 15.7830 | 15.7455 | |
0.4 | 13.3106 | 13.4890 | 13.0326 | 13.0766 | |
0.5 | 12.5500 | 12.8932 | 12.4917 | 12.5678 | |
0.6 | 16.4444 | 16.9558 | 16.4051 | 15.5541 | |
DC | 0 | 30.5103 | 32.2662 | 36.9264 | 39.9739 |
0.1 | 30.4394 | 32.3129 | 36.9792 | 39.9157 | |
0.2 | 30.4860 | 32.2304 | 37.0245 | 39.8639 | |
0.3 | 30.2126 | 32.1138 | 37.0334 | 39.8655 | |
0.4 | 20.8794 | 22.2660 | 27.2499 | 30.5149 | |
0.5 | 16.7517 | 16.6341 | 16.3208 | 16.6678 | |
0.6 | 13.7511 | 13.8123 | 13.5889 | 13.3938 |
Method | |||||
---|---|---|---|---|---|
Traditional RF | 16.4542 | 16.8286 | 20.2293 | 21.4920 | |
No | RLTNo1 | 11.1426 | 11.6650 | 13.5729 | 14.1749 |
RLTNo2 | 6.8722 | 7.3101 | 8.8551 | 9.6527 | |
RLTNo5 | 5.4821 | 5.8649 | 7.3025 | 8.0649 | |
Moderate | RLTMod1 | 14.9992 | 15.5370 | 18.7807 | 19.8693 |
RLTMod2 | 10.3251 | 10.8486 | 13.8485 | 15.1718 | |
RLTMod5 | 8.4156 | 8.9015 | 11.3316 | 12.5533 | |
CC | 0 | 16.4618 | 16.8028 | 20.2333 | 21.5206 |
0.1 | 13.0510 | 13.2847 | 16.1036 | 17.3913 | |
0.2 | 10.7760 | 10.5976 | 10.9608 | 11.1928 | |
0.3 | 10.2295 | 10.0385 | 10.0109 | 10.0872 | |
0.4 | 9.2580 | 9.0398 | 9.0732 | 9.1315 | |
0.5 | 8.5590 | 8.4243 | 8.5828 | 8.5259 | |
0.6 | 9.1113 | 9.0128 | 9.1327 | 9.0838 | |
DC | 0 | 16.4589 | 16.8685 | 20.2596 | 21.5370 |
0.1 | 16.4747 | 16.8312 | 20.2180 | 21.5444 | |
0.2 | 16.4707 | 16.7899 | 20.1973 | 21.5172 | |
0.3 | 16.3218 | 16.7368 | 20.2653 | 21.5056 | |
0.4 | 12.2518 | 12.5301 | 14.5710 | 15.9063 | |
0.5 | 10.3558 | 10.2450 | 10.2731 | 10.3228 | |
0.6 | 9.4236 | 9.2640 | 9.3533 | 9.3839 |
Method | |||||
---|---|---|---|---|---|
Traditional RF | 21.9640 | 23.6652 | 28.2053 | 30.0032 | |
No | RLTNo1 | 13.0988 | 14.2620 | 16.5793 | 17.3747 |
RLTNo2 | 7.3378 | 8.2417 | 10.2177 | 11.1712 | |
RLTNo5 | 5.5720 | 6.3689 | 8.2305 | 9.2038 | |
Moderate | RLTMod1 | 17.9596 | 19.3122 | 23.0986 | 24.3520 |
RLTMod2 | 11.4233 | 12.5715 | 16.2147 | 17.8654 | |
RLTMod5 | 9.1465 | 10.2833 | 13.5496 | 15.2372 | |
CC | 0 | 21.9342 | 23.6987 | 28.1940 | 29.9885 |
0.1 | 21.7451 | 23.6321 | 28.2193 | 29.9617 | |
0.2 | 20.9032 | 22.9340 | 27.5293 | 29.3341 | |
0.3 | 16.8882 | 18.6162 | 23.0721 | 25.1728 | |
0.4 | 11.9670 | 12.4959 | 13.4938 | 14.0448 | |
0.5 | 11.3873 | 11.7433 | 11.6022 | 11.3566 | |
0.6 | 9.0305 | 9.2198 | 9.3215 | 9.1254 | |
DC | 0 | 21.9021 | 23.7338 | 28.1547 | 29.9792 |
0.1 | 21.8892 | 23.7192 | 28.1623 | 30.0492 | |
0.2 | 21.8888 | 23.6486 | 28.1887 | 30.0208 | |
0.3 | 21.8853 | 23.7239 | 28.2238 | 29.9949 | |
0.4 | 21.6011 | 23.4470 | 28.0920 | 29.8334 | |
0.5 | 19.3799 | 21.3558 | 26.1481 | 28.1744 | |
0.6 | 12.5753 | 13.4929 | 15.1863 | 16.6041 |
Method | |||||
---|---|---|---|---|---|
Traditional RF | 9.4389 | 9.5245 | 10.4246 | 10.7869 | |
No | RLTNo1 | 8.6755 | 8.7385 | 9.4071 | 9.7955 |
RLTNo2 | 8.5479 | 8.6631 | 9.4587 | 9.9032 | |
RLTNo5 | 8.6720 | 8.7762 | 9.5994 | 10.0118 | |
Moderate | RLTMod1 | 9.6584 | 9.7615 | 10.7009 | 11.2133 |
RLTMod2 | 9.7378 | 9.8579 | 10.9569 | 11.4871 | |
RLTMod5 | 9.8222 | 9.9758 | 11.0402 | 11.6132 | |
CC | 0 | 10.5241 | 10.4354 | 11.7246 | 12.1731 |
0.1 | 11.0046 | 10.9849 | 12.0554 | 12.3790 | |
0.2 | 11.3745 | 11.1895 | 11.8162 | 11.9509 | |
0.3 | 10.8041 | 10.5800 | 10.9673 | 10.8763 | |
DC | 0 | 9.4371 | 9.5271 | 10.4387 | 10.7732 |
0.1 | 9.4270 | 9.5461 | 10.4322 | 10.7692 | |
0.2 | 9.4465 | 9.5276 | 10.4433 | 10.7636 | |
0.3 | 9.4336 | 9.5344 | 10.4295 | 10.7577 | |
0.4 | 8.9385 | 8.9611 | 9.6091 | 9.8364 | |
0.5 | 9.4990 | 9.4992 | 9.5010 | 9.4111 | |
0.6 | 10.4607 | 10.4244 | 10.3874 | 10.3362 |
Method | |||||
---|---|---|---|---|---|
Traditional RF | 6.1719 | 6.3132 | 7.0491 | 7.4381 | |
RLTNo1 | 2.4868 | 2.4958 | 2.9554 | 3.3648 | |
RLTNo2 | 2.5882 | 2.6486 | 3.3094 | 3.8033 | |
RLTNo5 | 2.8512 | 2.8675 | 3.5907 | 4.3271 | |
RLTMod1 | 3.1720 | 3.1258 | 3.8918 | 4.5346 | |
RLTMod2 | 3.6176 | 3.5186 | 4.5701 | 5.1221 | |
RLTMod5 | 3.7851 | 3.7519 | 4.8743 | 5.7918 | |
CC | 0 | 6.1638 | 6.2397 | 7.0040 | 7.4891 |
0.1 | 8.6832 | 8.9644 | 9.0353 | 9.1730 | |
0.2 | 10.7540 | 10.7789 | 10.7112 | 10.5731 | |
0.3 | 12.2340 | 12.2444 | 11.8764 | 12.3109 | |
DC | 0 | 6.1879 | 6.1030 | 7.0218 | 7.5451 |
0.1 | 6.1925 | 6.0984 | 6.9839 | 7.4811 | |
0.2 | 6.2513 | 6.0910 | 6.9863 | 7.4811 | |
0.3 | 6.1112 | 6.0962 | 7.0018 | 7.4744 | |
0.4 | 5.5324 | 5.5445 | 6.2003 | 6.7826 | |
0.5 | 2.6557 | 2.5385 | 2.8895 | 3.2704 | |
0.6 | 9.8633 | 9.5040 | 9.4988 | 9.9643 |
Traditional RF | 0.5029 | |
---|---|---|
No | RLTNo1 | 0.5521 |
RLTNo2 | 0.5459 | |
RLTNo5 | 0.5436 | |
Moderate | RLTMod1 | 0.5555 |
RLTMod2 | 0.5216 | |
RLTMod5 | 0.5623 | |
Threshold | CC | DC |
0.00 | 0.5026 | 0.5071 |
0.05 | 0.4936 | 0.5133 |
0.10 | 0.4866 | 0.5049 |
0.15 | 0.4654 | 0.5104 |
0.20 | 0.4521 | 0.5130 |
0.25 | 0.4356 | 0.5043 |
0.30 | 0.4217 | 0.5063 |
0.35 | 0.4083 | 0.5076 |
0.40 | 0.3864 | 0.5100 |
0.45 | 0.4076 | 0.5029 |
0.50 | 0.5594 | 0.4990 |
0.55 | 0.4175 | 0.4873 |
0.60 | 0.5565 | 0.4628 |
0.65 | NA | 0.4358 |
0.70 | NA | 0.4126 |
Traditional RF | 11.6123 | |
---|---|---|
No | RLTNo1 | 16.5492 |
RLTNo2 | 16.7430 | |
RLTNo5 | 16.0898 | |
Moderate | RLTMod1 | 16.0028 |
RLTMod2 | 15.6108 | |
RLTMod5 | 15.6015 | |
Threshold | CC | DC |
0.1 | 11.5548 | 11.5702 |
0.15 | 11.5674 | 11.5258 |
0.2 | 11.5926 | 11.5477 |
0.25 | 11.9115 | 11.5586 |
0.3 | 12.6297 | 11.5891 |
0.35 | 12.7505 | 11.5651 |
0.4 | 12.9315 | 11.5344 |
0.45 | 15.3672 | 11.5441 |
0.5 | 18.6801 | 11.5417 |
0.55 | 21.5029 | 11.5951 |
0.6 | 21.7865 | 11.9905 |
0.65 | 22.6410 | 12.5806 |
0.7 | 30.9052 | 13.0999 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ratnasingam, S.; Muñoz-Lopez, J. Distance Correlation-Based Feature Selection in Random Forest. Entropy 2023, 25, 1250. https://doi.org/10.3390/e25091250
Ratnasingam S, Muñoz-Lopez J. Distance Correlation-Based Feature Selection in Random Forest. Entropy. 2023; 25(9):1250. https://doi.org/10.3390/e25091250
Chicago/Turabian StyleRatnasingam, Suthakaran, and Jose Muñoz-Lopez. 2023. "Distance Correlation-Based Feature Selection in Random Forest" Entropy 25, no. 9: 1250. https://doi.org/10.3390/e25091250
APA StyleRatnasingam, S., & Muñoz-Lopez, J. (2023). Distance Correlation-Based Feature Selection in Random Forest. Entropy, 25(9), 1250. https://doi.org/10.3390/e25091250