1. Introduction
In data mining and machine learning, classification analysis is a well-researched method. Because of its ability to forecast future outcomes, it is used in a wide range of real-world scenarios. However, classification accuracy is directly proportional to the training data quality utilized. Real-world data frequently has an imbalanced class distribution, with the dominating majority class and ignoring the least ones.
When dealing with an imbalanced class distribution problem, selecting appropriate training data becomes crucial for improving classification accuracy. When all the available data are used for training, the resulting classifier tends to predict most of the incoming data as belonging to the majority class. This leads to the misclassification of minority class instances. Hence, careful selection of training data is essential to address the challenges posed by imbalanced class distributions in classification problems. In the context of forest ecosystems, the need for accurate classification algorithms cannot be overstated. Forests are a critical component of the planet’s ecological balance, sequestering and storing massive amounts of carbon from the atmosphere. The carbon stored in forest biomass is a crucial element of healthy forest ecosystems and the global carbon cycle.
Forests store carbon in various forms that can be challenging to accurately quantify. The estimation of carbon storage in forests depends on several factors, including the density of tree wood, decay class, and density reduction factors. Accurate estimations of carbon storage in forests are essential for effective carbon flux monitoring. Moreover, the classification of forest data is critical in determining the health and productivity of forest ecosystems. Forest classification algorithms can help identify various features of forests, such as tree species, forest density, and biomass, which are essential in monitoring changes in forest structure and function.
Forest-based accurate classification can also help to predict the occurrence and spread of forest disturbances like wildfires, insect infestations, and diseases. Such disturbances can cause significant losses of carbon from forests, negatively impacting the planet’s ecological balance. Therefore, the development of accurate and robust classification algorithms for forest datasets is critical for maintaining healthy forest ecosystems and mitigating the impact of natural disasters on the environment. In the realm of predicting tree decay rates in forests, past research has mainly focused on using regression techniques. However, these methods may not be suitable for distinguishing individual dead trees within a forest.
Deep neural network (DNN) architecture is aimed at detecting individual dead trees within the forest more accurately in this study. For that, this research work proposed a novel approach to deal with imbalanced datasets using sampling techniques. The imbalanced nature of forest datasets can make predictions less accurate, particularly when most data points belong to a single class (e.g., living trees). Therefore, by employing sampling techniques, we balanced the dataset, which improved the accuracy of predictions for both dead and living trees. This ultimately improves the accuracy of predictions made with unbalanced forest datasets. The organization of this research work is as follows. The dataset used for this research work is described first. Then, we employ a DNN with sampling techniques to forecast both dead and living trees. This method was then compared to other techniques for its efficacy. Finally, we present our findings and future directions.
Overall, the development of DNN architecture for predicting individual dead trees in forests, coupled with sampling techniques to handle imbalanced datasets, can raise prediction accuracy and contribute to better forest management. It enables forest managers to conserve and protect the forest ecosystem by making informed decisions.
2. Literature Review
In general, the process of classifying unbalanced datasets consists of three steps: selecting features, fitting the data distribution, and training a model. The review of the literature is presented below in
Table 1.
The goal of feature selection is to identify subsets of features that are most suited for classifying the unbalanced data while considering the feature class imbalance. This contributes to the development of a more efficient classifier [
10,
11,
12,
13]. To limit the impact of class imbalances on the classifier, most data preparation procedures, such as various resampling techniques, are used to adjust the data distribution [
14,
15,
16,
17]. These techniques significantly balance the datasets.
Model training to accommodate unequal data distribution requires primarily adding an enforcement algorithm to an existing classification approach or applying ensemble learning. Standard cost-sensitive learning is an example of the latter [
18,
19,
20]; it improves minority class classification accuracy by increasing the weights of the class samples. Classification accuracy can be achieved via ensemble learning techniques like boosting and bagging [
21,
22,
23].
Distribution-level data resampling will resolve the class imbalance. The most significant advantage of this methodology is that the sampling method and the classifier training procedure are independent of one another. Typically, the sample distribution of the training set is changed at the data preprocessing stage to decrease or eliminate class imbalance. The representative methods consist of a few resampling strategies, with the two main categories being oversampling and undersampling.
Oversampling entails adding appropriately created new points to increase the sample points in a minority class to attain sample balance. The synthetic minority oversampling method (SMOTE) and several of its variants, as well as ROS, are examples of prevalent algorithms [
24]. SMOTE generates synthetic samples and inserts them between a given sample and its neighbors, whereas datasets are balanced by ROS by adding minority sample points at random.
In Equation (1),
(
j = 1, 2, 3, 4, 5) represents a minority class point,
represents the generated virtual samples based on the nearest neighbors
, and rand (0, 1) is a random number between 0 and 1 [
4].
The earlier study relied heavily on local data to increase sample sizes. Although the number of samples is equalized, since the information on the overall distribution of the data is not taken into consideration, the data distribution of the new dataset following oversampling cannot be guaranteed. Furthermore, utilizing an oversampling approach may result in a big amount of redundant information, increasing the classifier’s calculation and training time.
Undersampling decreases the sample size in a majority class by eliminating some of them, and therefore has the apparent benefit of shortening training time. The most basic undersampling approach is RUS [
24], which discards majority class samples at random. To balance the magnitude of primary class samples with the least class samples, another undersampling strategy uses appropriate majority class samples. The training set will be more evenly distributed because of this method, which will also improve the classification accuracy of minority class samples. The disadvantage is that a sizable portion of the majority class sample characteristics could be lost, and the model might not fully acquire the majority class sample properties. As a result, it is crucial to make sure that the learning process is set up so that the bulk of the information covered in class is retained.
3. Materials and Methods
This research work is aimed at predicting the decay information of forest trees. Healthy trees absorb the harmful carbon dioxide and emit oxygen. Trees are the carbon sink of our planet. At the same time, decayed and fallen trees emit carbon dioxide. So, the identification of the decay level of a tree is essential to preserve the ecological condition of our planet. In this research work, details about trees in a forest are examined. Several attributes are associated with forest trees. The age of a tree is usually determined by its wood density. During the initial years, the wood density is increasing, and after attaining normal growth, the wood density starts decreasing. Based on the wood density, the trees are classified into five different decay classes ranging from “freshly killed” to “extremely decayed”. The dead trees fall, may cause forest fires, and it may take several years to decompose. Here, the dataset is preprocessed first to compute wood density and identification of decay class (either Not yet decayed or Decayed) using the wrapper method. Due to the imbalance in the dataset after decay class identification, stratified sampling is used to overcome this issue without losing any inputs [
25]. The stratified sampled input is fed to the DNN network for drawing predictions about the decay class of a tree.
This section contains a description of the proposed methodology, a description of the forest tree dataset, and the preparation process. The architecture of the proposed stratified sampling-based deep neural networks approach for increasing the prediction accuracy of the unbalanced dataset is shown in
Figure 1.
The proposed methodology can be categorized into three phases.
The neural network is chosen for classification in this research work over the SVM, Random Forest, and Naïve Bayes because of its ability to handle imbalanced data, feature learning capabilities, model nonlinear relationships, and the ability to fine-tune hyperparameters for optimal performance.
3.1. Description of the Dataset
The dataset was obtained from the USDA repository [
26]. Data collection began in 1985 and is expected to last until 2050. The Douglas fir, red cedar, Pacific silver fir, and Western hemlock tree are the four species used for investigation. The data gathered for this study compare the breakdown of tiny logs (20–30 cm in diameter and 2 m in length) in a stream channel at the H.J. Andrews Experimental Forest to that of logs on an adjacent upland site. Above the intersection of Mack Creek and Lookout Creek, the stream is of the third order. A portion of the logs are periodically resampled to assess changes in volume, bark cover, density, and nutrient reserves. Dry mass and volume, as determined by dimensional measures, are used to calculate wood density.
Table 2 shows the attributes in the dataset.
For training and testing, different proportions of the dataset were employed. The decay class and wood density of the relevant species are in the training dataset. Also, the wood density threshold value is present in the training dataset. The test data includes information on four species, including circumference, tree’s age, volume, dry weight, and moisture. A total of 54,000 instances with 21 attributes are available in the test dataset.
3.2. Preprocessing the Dataset
The dataset is preprocessed before the technique is applied. In this forest dataset, the data distribution is uneven among the live and decayed trees. A tree may belong to a not-yet-rotted or a decayed tree group. Out of the 11,387 trees in the dataset, 9132 belong to the not-yet-rotted group, whereas only 2255 trees belong to the decayed trees group. The data can be either overfit or underfit. This kind of uneven data distribution will have a critical impact on the problem of prediction and categorization, so the data need to be preprocessed.
The preprocessing stage consists of feature selection and checking the skewness of the data. This process will help to reduce the time consumption in handling the unbalanced forest dataset.
3.2.1. Feature Selection Method
The dataset is preprocessed with the feature selection approach back elimination for identifying the optimal subset attributes for forecasting the tree’s wood density (Kusy and Zajdel, 2021). Six of the twenty-one features that are essential for prediction were chosen via the wrapper method–back elimination.
The model is iteratively trained on several subsets of features using the wrapping technique, and the best subset of features is chosen. The choice of the feature subset selection is based on the inferences from the model. A feature selection strategy called backward elimination starts with a model that incorporates all the available features and gradually eliminates the least significant ones until a stopping requirement is met. This strategy, also known as a wrapper, is typically combined with statistical models to choose a subset of important features. By repeatedly removing elements that are the least significant based on the selected significance level, backward elimination assists in identifying the most pertinent characteristics.
Table 3 shows the extracted features using feature selection methods for further processing. Before assessing the feature subsets, these strategies train and test the model using a variety of feature combinations. The strategy reduces overfitting and eliminates pointless or unnecessary features to enhance the model’s performance and interpretability.
In the experimental dataset, the explanatory variables Species, Diameter, Volume, Wet Weight, Dry Weight, and Decay are considered for multiple linear regression, and the target variable is Wood density of the tree.
The prediction equation is given below.
where, for
n observations
is the dependent variable, and Species, Diameter, Volume, Wetwt, DryWt, and Decay are the explanatory variables,
is the y-intercept (constant term)
are the slope coefficients for each explanatory variable (j indicates attribute index)
is the model’s error term (also known as the residuals)
3.2.2. Checking the Skewness of the Data
Classifiers are built up in machine learning to eliminate misclassification errors and, as a result, optimize predictive accuracy. The class imbalance problem, which refers to an uneven distribution of response variable values, is one of the most prevalent issues that influence raw data.
An unbalanced dataset is one in which the number of samples in different classes is highly uneven, making classification difficult. With uneven data, modern machine learning techniques struggle because they focus on reducing error rates serving the dominant class while disregarding the underrepresented group. Classification becomes extremely difficult because the results may be skewed by dominant class values.
As per the experimental dataset, a tree may belong to any one of the five different decay levels ranging from 1 to 5. If a tree belongs to class 1, it means it is not yet decayed; otherwise, it has a decaying component. Since our aim is to classify trees, we considered only two classes, namely “Not yet Decayed” trees and “Decayed” trees. The dataset is considered for the experimental study of the class imbalance problem. As mentioned earlier, there are possibilities of overfit or underfit.
The class details are given below.
Class 0: 9132 (Not yet Decayed)
Class 1: 2255 (Decayed Trees)
The class imbalance problem in the experimental dataset is depicted in
Figure 2.
3.3. Stratified Sampling-Based Deep Neural Network (SSDNN) Approach
The process of classifying unbalanced datasets involves three main steps: feature selection, fitting the data distribution, and model training. Feature selection helps to identify the most suitable subsets of features for classifying unbalanced data while considering the class imbalance among the features. Various resampling approaches that minimize the impact of class inequality on the classifier can be used to fit the data distribution.
The most common resampling strategies are oversampling and undersampling. These strategies aim to balance the datasets by increasing or decreasing the sample points in the minority and majority classes, respectively. However, oversampling algorithms may generate duplicate information and increase the training time of the classifier, while undersampling may result in the loss of the majority of class information.
Both random oversampling (ROS) and random undersampling (RUS) violate the law governing data distribution. The generated samples might not be helpful in illustrating the distribution. SMOTE has drawbacks like supersampling the noisy samples and uninformative data. It is highly challenging to determine the closest neighbors of anonymous synthesized samples. Also, the SMOTE samples are always contained within the samples, and pruning them will lead to an increase in misclassification rate.
We proposed stratified random sampling method to resolve said issue, which will perform the task of test input selection for DNNs. According to the sampling theory, stratified random sampling involves dividing a population into smaller groups without any duplication and avoiding records. The proposed method increases the computation efficiency in the reliability evaluation of the model.
The stratified sampling approach divides the data into blocks based on specified values to extract the structural facts of the data and then draws samples at random from these distinct data blocks. Stratification makes it simple to find representative samples. In the case of forest datasets, stratified sampling can be applied to guarantee that the number of samples for each class is balanced and that the variance of the data within each class is considered when choosing the optimum number of samples. This helps to preserve the original data structure feature information while also ensuring a balance in the number of samples for the majority and minority classes. The specific procedure is to randomly select some examples from both positive and negative occurrences and then combine the training samples for classification. Stratified sampling is best suited for the uneven distribution of data, and it is applied to different domains [
25,
27,
28,
29].
The diversified dataset N is split into similar groups, S0, S1, and so forth. For data selection, Sn, also known as strata, utilizes uniform random or systematic sampling in each stratum. The reduction in estimation error is the primary advantage of stratified sampling over other sampling techniques. Within strata, a sample for data analysis is taken via random sampling after relative homogenous data objects are grouped together based on the necessary parameters.
The Stratified sampling-based deep neural networks approach is shown below in
Figure 3.
Deep learning is a feed-forward neural network with one or more hidden layers. Deep learning is a subfield of machine learning that emphasizes the use of numerous linked layers to transform input into features and predict associated outputs. Artificial neural networks are the core of it. Input, output, and numerous hidden layers are all present in deep neural networks (DNNs). The hidden layer is in the middle, after the input layer and preceding the output layer. In the training of a deep neural network, the following steps are taken: first, initialization is performed according to requirements, and the structure of the DNN is set; second, the layer is then communicated between layers to obtain an error using forward; and finally, the layer is transferred between layers to obtain an error using forward.
DNNs can handle both linear and nonlinear issues by monitoring the probability of each output layer by layer with an appropriate activation function. In essence, DNNs are fully linked neural networks. A deep neural network is sometimes known as a multi-layer perceptron (MLP). The hidden layer alters the input feature vectors, which eventually arrive at the output layer, where the binary classification result is obtained.
Environments have been interested in determining functional links between carbon storage and plant uncertainty of wood density; an appropriate technique is required. Developing empirical models to forecast the DECAY CLASS of the tree is the focus of this research. A deep neural network, a subset of expert systems, predicts the DECAY CLASS of the tree more accurately than standard models. Because there was no constraint for constructing models in DNN, the outcomes are more accurate predictions than the ensemble model. The data loss was achieved using the training data, as shown in the topology of the model, implying that there was no overfitting.
The suggested work’s learning model has four layers: one input layer, two hidden layers, and one output layer is shown in
Figure 4. At the last three layers, the ReLu activation function was utilized, and the sigmoid function was used at the output layer. The binary cross-entropy loss between the input was used to establish the objective function, which should be minimized in the NN. Adam’s optimization was chosen above other existing optimization techniques because it was more efficient. To create a model, each dataset was first randomly divided into two parts: a 75% training set and a 25% test set.
The training set is examined for skewness and, if necessary, balanced using a stratified sampling procedure. The balanced training set is then used to develop DNN models and train them, while the test sets are utilized to evaluate the performance of the predictive models. We used the following easy method to choose the best threshold. The curve of balanced accuracy as a function of prediction is first plotted. The best threshold was finally determined to be where the DNN achieved the most balanced accuracy. The unbalanced learn library from Python was then used to apply each data-balancing technique to each training batch. The model has been tried with different numbers of mini-batches as 10, 50, 25, and 100 and determined 100 as the best choice with epoch sizes as 10, 25, and 50.
3.3.1. Algorithm for SSDNN Model
The algorithm for the proposed SSDNN model is given in Algorithm 1. This proposed SSDNN model will first extract the features required for the job and verify whether the ratio of the dataset is unbalanced or not. Next, it will choose the right samples for prediction. Below is a representation of the suggested model algorithm.
Algorithm 1. Proposed Algorithm for SSDNN Model |
1. | Import the dataset |
2. | Perform the Wrapper method (Back Elimination Method) |
3. | Check the Skewness of the dataset |
4. | Apply Stratified |
5. | Update the imbalanced dataset |
6. | Load the training dataset |
7. | Train the DNN |
8. | Shuffle and Split as 75% and 25% |
9. | Use SVM-Kernel for classification |
10. | Tune the Parameters |
11. | Apply to the test dataset |
12. | End |
3.3.2. DNN with Hyperparameter Tuning
Deep neural network hyperparameter tuning employs a random search to identify the ideal hyperparameter combination from a set of hyper parameter values. Random search resulted in a set of 20 hyperparameter combinations. The following are the best hyperparameters found via random search.
Finally, the model is hyper-tuned using a random search approach, where the optimum parameters are 2 hidden layers, 400 neurons, ReLu, 50 epochs, and a batch size of 100 as listed in
Table 4.
When the number of epochs increases, the accuracy of the proposed method also increases, and we obtain maximum accuracy when the epoch is closer to 100. The built model is compared with the existing models, and the performance is analyzed in the results and discussion section.
4. Experiment Results and Discussion
To recognize dead and live trees, we used the forest tree dataset to perform our classification. The dataset was preprocessed to determine the relevance of the variables for categorization. The dataset was split into two parts: training and testing. We used a training dataset to train the DNN and a test set to evaluate classifier performance. We conducted a huge number of trials to discover the ideal DNN design and parameters, using various combinations of batch sizes, number of hidden units, and learning rate.
Because of the imbalanced dataset, DNN accuracy is good, but other performance metrics like F1Score, Precision, and Recall value are low. As a result, the dataset is balanced via stratified sampling, and the resulting strata are supplied to DNN as a training set. The result of the proposed model is compared with the previous model SVM, Naïve Bayes, and Random Forest. Earlier, we tried to perform the classification using these three models with different datasets. Each model has its own credits and pitfalls. For the smaller datasets, SVM produces better results but is not promising for larger ones. At the same time, Random Forest is one of the best choices for larger datasets but is time consuming. Naïve Bayes is simple, and it is not preferred for large datasets. It assumes that the variables are independent.
It is evident from the results that the proposed model gives high accuracy in addition to performing well in the case of large datasets. The proposed approach is written in Python using Jupyter Notebook and uses the Keras package on a 64-bit OS with an X64 CPU, and the model worked well on the Google Lab platform. Thus, by combining DNN with a stratified sampling-based deep learning model, the prediction and classification of dead trees in the forest are successfully completed. Forest managers will be able to predict the early stages of decaying trees with this information. The proposed method can also be applied to similar datasets belonging to different domains.
4.1. Performance Metrics
The efficiency of the proposed method is analyzed using classification accuracy, Precision, Recall, and F1 Score. The performance of the three approaches, namely SVM, Random Forest, and Naïve Bayes, with different sampling techniques, are depicted in the following figure.
4.2. Results and Discussion
Table 5 shows the comparison of test accuracy among the proposed DNN models with sampling methods. The performance in terms of accuracy of the existing and proposed algorithms along with different sampling techniques are shown in
Figure 5.
The performance of the proposed SSDNN method with different existing sampling techniques is shown in
Figure 6.
The DNN, DNN+ oversampling, DNN+ undersampling, DNN+ SMOTE, and DNN+ stratified sampling yields test accuracy of 80%, 76%, 69%, 78%, and 91%, respectively. First, the DNN model was created and tested on the prepared dataset, yielding low accuracy. The DNN model was analyzed for the reason of yielding low accuracy, and it was found that the dataset was unbalanced. The imbalanced dataset was subsequently handled using a stratified sampling technique, which divided the training dataset into groups of distinct strata for each class. The data from each stratum was distributed uniformly to the deep neural network, resulting in good accuracy, precision, recall, and F1 score. Several tests using the tree dataset were carried out to determine the optimal deep neural network.
The training, as well as testing accuracy and loss of the proposed SSDNN, is visualized in
Figure 6. From the figure, during the initial epochs, accuracy is not appreciable, and at the same time, the loss is highly noticeable. But in the subsequent epochs, the results are more promising. Similarly, the same parameters are analyzed for the testing phase. The testing phase also has the same impact on model accuracy and model loss. To observe the variations more clearly, the chart is prepared up to 25 epochs.
Also, the training/testing accuracy and loss of the proposed method is shown in
Figure 7. The proposed DNN + stratified sampling results in an accuracy of 91% with higher efficiency. The proposed model was compared to the ensemble SVM kernel algorithm used in prior work, and the results show that the proposed DNN + stratified model is more efficient. The proposed method is robust compared to the traditional methods due to hyper-tuning, low false positive, and high recall.
5. Conclusions
In this research, we experimented to find the best model to classify the forest tree as a dead or live tree. For predicting the decay class of a tree, the classification models DNN, DNN+ oversampling, DNN+ undersampling, DNN+ SMOTE, and DNN+ stratified sampling were applied to the dataset. The results show that DNN+ stratified sampling offers better performance with high accuracy.
The proposed method correctly classifies a tree as either dead or alive compared to other models. The proposed model will be suitable to handle any imbalanced dataset for classification. In deep learning, classification accuracy often increases when the amount of data used for training increases; thus, using a larger dataset for training can be a good research direction to continue improving our classification accuracy of forest tree classification. This paper suggests that identifying decaying trees earlier will help forest managers in removing them before they begin to emit carbon back into the atmosphere.
This research promotes reforestation by planting a new tree after removing a dead tree to reduce pollution and forest fires. In the case of stratified sampling, the research gap discovered is that the number of records in both classes is not equal; hence, deficit records occur when training the model. To address this issue, the deficit class is oversampled, strata are shuffled, and the model is trained to increase model efficiency. In future work, the proposed method can be applied to smart forest management. Since there may be uneven data or irrelevant data during data collection, we can use IOT-based RFID for each tree to automate data collection for the tree and also to indicate its level of decay and carbon absorption.