2.2.1. Principles of Random Forest Algorithm [21,22,23]
Definition 1. A Random Forest is an ensemble classifier consisting of a group of decision tree classifiers {h(X, θk), k = 1, 2, …, K}. Each tree is formed by a random vector {θk}, which is independently and identically distributed. K denotes the number of decision trees in the random forest. For a given independent variable X, each decision tree classifier votes to decide the optimal classification result.
Essentially, a random forest is a classifier that aggregates multiple decision trees; if a decision tree is considered an expert in the classification task, then a random forest is a collection of such experts collaboratively classifying a given task.
The steps to generate a random forest are as follows:
Step 1. From the original training dataset, apply the bootstrap method to randomly draw K new bootstrap samples with replacement, constructing K classification regression trees. Each time, the samples not drawn form the K out-of-bag (OOB) datasets.
Step 2. If there are n features, randomly select mtry features (mtry ≤ n) at each node of each tree. Compute the information contained in each feature and select the one with the highest classification capability for node splitting.
Step 3. Grow each tree to the maximum extent without pruning.
Step 4. Combine the generated trees to form a random forest. Classify new data using this forest, with the classification result determined by the majority vote of the tree classifiers.
Definition 2. Given a set of classifiers h1(X), h2(X), …, hk(X), where each classifier’s training set is randomly sampled from the original dataset (Y, X) that follows a random distribution, the margin function is defined as:where
is the indicator function. The margin function measures the extent to which the average number of correct classifications exceeds the average number of incorrect classifications. The larger the margin value, the more reliable the classification prediction.
Definition 3. The generalization error is defined as:where P covers the space of X and Y. In a random forest, when there are enough decision tree classifiers, follows the strong law of large numbers.
Theorem 1. As the number of decision trees in the random forest increases, for all sequencesalmost surely converges to: Theorem 1 indicates that the random forest does not suffer from the overfitting problem as the number of decision trees increases, but it may have a certain degree of generalization error within limits.
2.2.2. Optimized Strategies
(1) Feature Importance Evaluation and Feature Selection
Evaluating the importance of various features helps identify the most influential features for the prediction target. This paper adopts the variable importance measure based on out-of-bag (OOB) data classification accuracy.
Definition 4. The variable importance measure based on OOB data classification accuracy is defined as the average reduction in classification accuracy before and after a slight perturbation of the OOB data’s independent variable values.
Suppose there are bootstrap samples b = 1, 2, …, B, where B represents the number of training samples. The variable importance measure of feature Xj is calculated as follows:
Step 1. Set b = 1, create a decision tree Tb on the training sample, and mark the OOB data as .
Step 2. Use Tb to classify data, and count the number of correctly classified instances, denoted as .
Step 3. For feature Xj, j = 1, 2, …, N, perturb the value of Xj in , and denote the perturbed dataset as . Use Tb to classify data and count the number of correctly classified instances, denoted as .
Step 4. For b = 2, 3, …, B, repeat steps 1–3.
Step 5. Feature importance calculation: The variable importance measure
of feature
Xj is calculated using the following formula:
Step 6. Feature Ranking: Rank all features in descending order based on their importance measures.
Step 7. Feature Selection: Select the top-ranked features for model training and prediction.
(2) Random Forest Parameter Optimization
The performance of the random forest model is mainly influenced by the number of trees n, the maximum depth of each decision tree d, the minimum number of samples required to split an internal node s, and the maximum number of leaf nodes m. Traditional methods use grid search for parameter optimization, which involves calculating the objective function value for each parameter combination by traversing each grid point. This approach often becomes trapped in local optima and has high computational complexity. To improve the efficiency and accuracy of parameter optimization, this paper employs a genetic algorithm for adaptive parameter optimization.
As a classical intelligent algorithm (such as Harmony search algorithm [
24,
25], cuckoo search algorithm [
26,
27], and Sparrow Search Algorithm [
28]), the Genetic Algorithm (GA) is considered one of the most classical heuristic algorithms, with perfect performance. The GA combines mathematical analysis and computer simulation methods, transforming the optimization problem-solving process into mechanisms similar to Darwin’s theory of natural selection and genetic mechanisms like chromosome gene crossover and mutation. Compared to some conventional optimization algorithms, this approach can often achieve better results faster, making it suitable for solving complex combinatorial optimization problems [
29]. The steps for optimizing the random forest algorithm based on the genetic algorithm are as follows.
Step 1. Initialization: Randomly generate a set of parameter combinations containing n, d, s, and m to form the initial population.
Step 2. Evaluation and Selection: Apply the random forest algorithm to each parameter combination in the initial population. Use classification accuracy as the evaluation criterion, calculate the classification accuracy of each parameter combination, and select the best parameter combination. Classification accuracy is defined as:
where TP (true positive) represents correct positive predictions; TN (true negative) represents correct negative predictions; FP (false positive) represents incorrect positive predictions; FN (false negative) represents incorrect negative predictions.
Step 3. Genetic Operations and Iterations: Perform crossover operations on the selected excellent parameter combinations to generate new potential excellent parameter combinations. Optionally introduce mutation operations to increase population diversity and prevent local optima. Form a new generation population with the newly generated parameter combinations and the excellent individuals from the parent generation. Repeat evaluation, selection, and genetic operations until the preset maximum number of iterations or until the fitness improvement threshold is met.
Step 4. Output Results: Select the parameter combination with the highest fitness (highest accuracy) from the final population. Output this optimal parameter set as the best configuration for the random forest algorithm.