Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Distance Correlation-Based Feature Selection in Random Forest

Entropy 2023, 25(9), 1250; https://doi.org/10.3390/e25091250

by Suthakaran Ratnasingam^*

and Jose Muñoz-Lopez

Reviewer 1: Anonymous

Reviewer 2:

Boxiang Wang

Entropy 2023, 25(9), 1250; https://doi.org/10.3390/e25091250

Submission received: 30 May 2023 / Revised: 20 July 2023 / Accepted: 21 August 2023 / Published: 23 August 2023

(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)

Round 1

Reviewer 1 Report

The authors propose a new approach for building predictive models by combining distance correlation screening/thresholding and random forest. Some theoretical justification is provided. Numerical studies demonstrate the superior performance of the proposed method over traditional methods, such as no screening and screening via Pearson correlation. In my opinion, the proposed methodology is relatively straightforward, which reduces the novelty of the paper. Below are my detailed comments.

In practice, it is common to see analysts performing a variable screening before building machine learning models. According to the simulation, CC performs better under the linear case while DC under the nonlinear case. In practice, it is hard to know for sure whether the data exhibits linearity or nonlinearity. Therefore, why not do screening using both Pearson and distance correlations? Overall, I find the results in the numerical studies not entirely convincing.

How do analysts select the threshold parameter R*? Apparently, the performance of DC highly depends on the choice of R*. The authors should consider providing some guidelines for selecting such a tuning parameter.

Author Response

Please see the attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper aims to address the challenges posed by high-dimensional data and feature selection in traditional Random Forest (RF) algorithms. The authors propose a method incorporating distance correlation (dCor) into Random Forest regression for feature selection. dCor, a metric able to capture all types of dependencies between random vectors of arbitrary dimensions, is utilized as a preprocessing step in the RF algorithm. This assists in selecting features significantly correlated to the response variable. Through simulations and real-world applications, the authors demonstrate that their approach is competitive and often superior in handling high-dimensional nonlinear datasets.

The proposition of applying distance correlation prior to implementing random forest is not entirely novel. Other studies (e.g., https://arxiv.org/pdf/2006.12919.pdf and https://egrove.olemiss.edu/etd/1800/?utm_source=egrove.olemiss.edu%2Fetd%2F1800&utm_medium=PDF&utm_campaign=PDFCoverPages) have explored this approach. Therefore, the authors should clearly articulate what differentiates their proposal from the existing literature.
The threshold R* selection is a sensitive aspect of the model. A poor selection may negatively impact the performance of the random forest. It would be beneficial if the authors could discuss whether this crucial value can be determined independently of the out-of-bag error.
Notably, in most experiments presented in this paper, the RF models appear to underperform compared to the RLT models. This suggests that exploring more complex models could better demonstrate the necessity and advantages of using random forests. The authors might wish to delve into this aspect further.

Author Response

Please see the attached file.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

My concerns have been addressed.

Article Menu

Distance Correlation-Based Feature Selection in Random Forest

Further Information

Guidelines

MDPI Initiatives

Follow MDPI