In this study, a RF model [
31] is adopted for the purpose of soil classification. The process of applying the RF methodology to soil classification is depicted in
Figure 3. Key to this approach are several critical stages, commencing with the collection and preparation of data. Subsequently, feature selection is carried out to pinpoint the most pertinent attributes for precise classification. After this, the model is built using the chosen features, and hyperparameters are adjusted to enhance its performance. An assessment of the model’s effectiveness ensues, followed by thorough testing to ensure its robustness and reliability. In this study, the training data comprise 70% (364,921 data), while the testing data make up 30% (156,395 data). Finally, the results are interpreted and scrutinized to derive meaningful insights. The following sections elaborate on each of these steps in meticulous detail.
4.1. Data Collection
A comprehensive dataset comprising various features such as Cc, Cu, PI, the classification of soil as organic or inorganic, LL, percentage passing No. 4 sieve, and percentage passing No. 200 sieve, along with their corresponding soil classifications, has been collected for analysis. These parameters are delineated within specified ranges to form a synthetic database for training the RF model. The compiled parameters are detailed in
Table 1 provided above. The details of the database established in this study are as follows:
Initially, this study lists the reasonable ranges for seven factors and hypothesizes the possible values of each factor within their respective ranges using different intervals, as listed in
Table 1. For example, the Cc has a reasonable range of 0 to 10, and its values are assumed at intervals of 1. Therefore, the values of the Cc in the database are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. Other factors are hypothesized in the same way.
Subsequently, different values of each factor are combined, and unreasonable combinations are eliminated. Examples of unreasonable combinations are as follows: (1) The percentage passing the No. 4 sieve must be greater than the percentage passing the No. 200 sieve. Therefore, combinations where the percentage passing the No. 4 sieve is less than the percentage passing the No. 200 sieve are eliminated. (2) The LL of the soil must be greater than the PI. Therefore, combinations where the LL is less than the PI are eliminated. (3) Coarse-grained soils are typically not organic. Therefore, combinations where the soil is both coarse-grained and organic are eliminated.
Finally, by integrating the classification criteria for various soils, the results of soil classification are established in
Table 1’s database. Overall, this study comprises a total of 521,316 datasets. This database is then used for training the RF model.
4.2. Out-of-Bag (OOB) Predictor Importance
In this study, the Out-of-Bag (OOB) feature of the RF model is utilized to assess the importance of seven factors in soil classification. Through the OOB method, we examine the RF model’s training process and utilize the OOB samples to estimate the decrease in prediction accuracy when specific feature arrangements are utilized.
The OOB predictor importance serves as a technique for assessing the relevance of features within the RF model. It consists of multiple decision trees, each constructed using distinct training and feature subsets. As these trees are built using different random samples and features, certain data points may never be utilized during training, known as OOB data. The OOB predictor importance evaluates each feature’s contribution to the model based on its performance in the OOB testing. This importance is determined by the frequency of feature usage across all trees and the average reduction in testing error observed when the feature is employed in each tree. The OOB predictor importance offers an intuitive means to gauge features’ impact on the RF model’s predictive performance and aids in selecting the most crucial features for modeling purposes. By integrating the importance of the OOB predictors, influential predictors are identified within the RF model utilized.
The analysis findings regarding variable importance using the OOB predictor, as shown in
Figure 4, reveal intriguing insights. Factors 1 (Cc) and 2 (Cu) exhibit negative importance values, indicating their diminished relevance in the classification process. The OOB analysis results of this study align well with the facts. Specifically, the coefficients Cc and Cu are not employed for classifying all soil types. These coefficients are primarily used to describe the grain size distribution of coarse-grained soils. They are not applicable to fine-grained soils (such as silt and clay), which lack a significant range of particle sizes, and are also not used for coarse soils with fines greater than 12% or for organic soils. Conversely, Factors 5 (LL) and 7 (percentage passing No. 200 sieve) demonstrate the highest importance values, highlighting their pivotal roles in soil classification. This observation implies that LL and the percentage passing No. 200 sieve are not only critical but also integral factors that significantly influence the outcome of the classification process, emphasizing the need for careful consideration in soil analysis and interpretation.
4.3. Model Construction
The process of initializing the RF model begins with a predetermined number of decision trees. Subsequently, it undergoes training on the designated dataset, where each tree’s growth occurs through the utilization of a bootstrap sample of the data. At each node, the model selects the optimal split based on a subset of features. Illustrated in
Figure 5, the RF’s architecture involves the assembly of multiple decision trees, with each tree constructed independently utilizing a random subset of the training data and input features. For this study, the training and testing datasets constitute 70% and 30% of the total data, respectively. Throughout the training phase, each tree expands either until it reaches its maximum depth or meets a specified stopping criterion, such as the minimum number of samples required for node splitting or a maximum depth threshold.
Once all the decision trees are built, predictions are produced by consolidating the outputs from each individual tree. In regression tasks, the final prediction is typically the average of all tree predictions, while in classification tasks, it is usually determined by a majority vote among the trees. The inclusion of randomness in the construction of each tree aids in diminishing correlations among the trees and mitigating the risk of overfitting. Assuming there exists a database
D, it can be represented as follows:
In this equation,
x,
y,
N, and
p are the input, output, data number, and number of factors. If
D is divided into
M regions and
is obtained, and a constant
is used to represent the simulated output
of each region, the following equation can be obtained:
where
I is an indicator function. By incorporating the least squares sum as a criterion, the optimal constant,
, can be obtained as the average of the output values,
, within the region:
Assuming the presence of a categorical variable
j and a designated split point
s, the database is partitioned into two distinct subsets, as indicated by the following equation:
As per the preceding equation, the quest for the suitable categorical variable
j and split point
s results in the following equation:
Referring to the equation above, the internal minimization for any combination of
j and
s can be deduced from the subsequent expression:
Utilizing the aforementioned equations, the optimal pair (
j,
s) can be determined, facilitating the partitioning of the data into two regions. Iterating through the described computations enables the data to be sequentially split into all resulting regions. If a decision tree
T partitions the data into
regions via
m nodes, where
represents the total number of regions,
can be articulated as follows:
Bagging, or bootstrap aggregation, is a technique employed to acquire an aggregated predictor by creating numerous predictor variations and amalgamating them. When this aggregated predictor is employed for numerical prediction, it calculates the average of the results from each variation and may also conduct a majority vote on prediction outcomes. Different predictor variations are obtained by sampling from the dataset, with each sampling akin to modeling a novel dataset.
Assuming a database
D as described earlier, it is divided into smaller datasets,
, and
to obtain
. The sampling process involves a fixed number of samples each time, and the sampled data are replaced back into the original dataset before the next sampling. After calculating each small dataset,
, using the base algorithm, their results,
, are collected, and the final training result is obtained by averaging all results,
, expressed as follows:
where
is the output results obtained for each small dataset by the base algorithm.
In an RF model, each decision tree necessitates the configuration of specific hyperparameters, including the learning rate, number of iterations, and number of features employed for node partitioning, among others. These hyperparameters’ selection significantly influences the model’s performance. In this study, the proposed model is trained on bootstrap samples from the dataset, and predictions are aggregated to produce the final output. While the RF method naturally utilizes bagging, assessing the performance of the ensemble itself offers valuable insights into model stability and accuracy.