1. Introduction
For many years, the cone penetration test (CPT) has been the predominant method for conducting field exploration in geotechnical engineering [
1,
2,
3,
4,
5]. This test requires a cone-shaped instrument to be inserted into the soil at a consistent penetration rate, while measuring the cone tip resistance (qc) and sleeve friction (fs). The CPT continuously provides precise, repeatable results for its entire profile depth. Moreover, the CPT is a relatively quick and inexpensive means of acquiring field data for estimating parameters for many applications, such as soil classification, environmental studies, hydrological analysis, and seismic site response assessments.
Soil classification is essential in geotechnical engineering, especially when evaluating site response to seismic events. Accurate soil classification helps to understand the dynamic properties of soil and the effects of earthquakes on the soil’s behavior. The traditional soil classification based on the CPT data involves analyzing 2D charts. Early research was aimed to predict the distribution of soil particles by using the CPT measurements, as outlined in the pioneering work of Begemann [
6]. However, later work by Douglas and Olsen [
7] suggested that a more useful soil classification approach in practical engineering projects would involve considering soil behavior, rather than solely relying on soil particle distribution. As a result, Robertson developed soil classification charts based on a soil behavior type index using CPT measurements [
4,
8]. Additionally, there are alternative methods for soil classification. The Unified Soil Classification System (USCS), for example, relies on extensive field and laboratory tests to classify soil [
9].
Soil classification and parameter estimation using traditional methods can be costly and time-consuming. Field and laboratory testing is required, and soil samples need to be transported to a laboratory where particle size distribution and Atterberg limits are conducted. These tests take time to complete, and the results may not be immediately available. Additionally, soil properties can significantly change with variations in the temperature and moisture content. However, in recent years, machine learning (ML) techniques have shown great promise in soil classification. Many studies have demonstrated the potential of ML techniques in soil classification based on CPT measurements [
10,
11,
12,
13,
14,
15].
The study conducted by [
15] explored the feasibility of utilizing a general regression neural network (GRNN) to predict soil composition and overall soil type employing CPT data. The research demonstrated that the GRNN model successfully categorized soils as either coarse-grained or fine-grained. Similarly, studies have demonstrated the effectiveness of artificial neural network (ANN) models in predicting complex soil profiles [
16,
17,
18]. In addition, various machine learning techniques such as random forests (RFs) [
19], support vector machines (SVMs) [
20,
21], decision trees (DTs) [
22], gradient boosting machine (GBM) [
23,
24,
25,
26], and logistic regression (LR) [
27] have been utilized for a variety of geotechnical engineering applications including classification and liquefaction.
ML techniques have been widely used in various fields, including image [
28,
29,
30] and speech recognition, natural language processing, and data analysis, to extract insights from large datasets. For instance, ML algorithms are used to identify objects, faces, and patterns in images, which is crucial in facial recognition, autonomous driving, and object detection in security systems. ML algorithms are also used to transcribe speech into text, enabling the creation of voice assistants, language translation software, and speech-to-text dictation tools. In addition, ML algorithms are used to analyze large datasets, uncovering patterns and insights that would be difficult or impossible to identify manually. This technique has numerous applications in finance, healthcare, and scientific research.
Although machine learning (ML) techniques have been widely applied in various fields, there has been limited research on their use in geotechnical engineering. However, researchers have started exploring the potential of ML techniques for soil classification and estimation of soil parameters using CPT data. Some geotechnical researchers applied ML techniques to predict various geotechnical properties such as landslide [
31], slope stability [
32,
33,
34], soil type [
12], and shear wave velocity [
13] utilizing CPT data.
In our study, we aim to evaluate the performance of four commonly used ML algorithms, including artificial neural network (ANN), random forest (RF), support vector machine (SVM), and decision tree (DT), for soil classification using CPT data. This study has the potential to address the gap in the existing literature and offer valuable insights into the efficacy of ML algorithms for soil classification through CPT data. Furthermore, the findings of this study could help improve the efficiency and accuracy of soil classification in geotechnical engineering, which could have significant implications for geotechnical engineering.
The selection of the ML model for a classification task is based on several factors, including desired accuracy, dataset size, generalization ability, interpretability, and robustness. For our specific soil classification problem, we chose to evaluate the performance of the ANN, DT, SVM and RF algorithms, each with its own strength and weakness. ANNs are known for their ability to capture complex non-linear relationships in the data [
35], while RF is the ML algorithm that utilizes multiple decision trees to enhance the accuracy and robustness of the model [
36,
37]. SVM can handle high-dimensional data and nonlinear decision boundaries [
38,
39], and DT is easy to interpret and visualize, and can handle both categorical and numerical data. By selecting these four algorithms, we aimed to strike a balance between complexity and interpretability and compare the performance of the models. Our choice of algorithms provides a diverse set of models that can handle various aspects of the classification task, including complex relationships, high-dimensional data, and interpretability. By evaluating their performances, we hope to gain insights into which algorithm is the most suitable for our specific soil classification problem.
The performance of the ML models can be compromised by various factors if not properly addressed. One of the critical factors that can affect the models’ performance is the selection of hyperparameters. By carefully selecting and tuning the hyperparameters, we can improve the models’ robustness and ensure that they can perform optimally in real-world applications [
40]. Grid search (technique in which sets of predefined hyperparameter values are defined) is one of the most commonly used methods for hyperparameter tuning [
41]. Bayesian optimization is another approach that uses a probabilistic model to estimate the performance of different hyperparameter configurations [
41,
42].
ML models rely on input features to make predictions, and the quality and relevance of those features can have a significant impact on the performance of the models [
43]. Performing feature importance such as permutation feature importance [
44] and eliminating irrelevant features from the dataset can significantly enhance the performance of ML models. Feature importance is the process of determining the most important features in a dataset for a given model.
In ML, outliers are one of the factors that contribute to the poor performance of ML models. According to the literature [
45,
46], outliers are data points that deviate significantly from the surrounding data points. Abnormal data readings during CPT operations can primarily occur due to human or procedural errors, such as the addition of a rod [
47]. These outliers are not representative of the actual CPT measurements and should be detected and removed during the data preprocessing stage.
The structure of this paper comprises six sections. The first section provides a detailed discussion on the background of soil classification and ML models. In
Section 2, the cone penetration test is explained, while
Section 3 outlines the dataset preprocessing and methodology utilized. The ML models employed in this study are briefly summarized in
Section 4. In
Section 5, a detailed discussion is presented on the results obtained from the ML models. Finally,
Section 6 provides a summary of the main points of the study and concludes the paper by proposing recommendations for future research.
2. Cone Penetration Tests
The CPT is a widely used in situ geotechnical testing method that involves inserting a cone-shaped penetrometer into the soil and recording the soil’s resistance (i.e., qc and fs) to penetration.
Figure 1 visually represents a graph that plots the recorded qc, fs, and friction ratio used in this study.
The CPT and its variations, such as the CPT with pore pressure measurement (CPTu) and the seismic cone penetration test with pore pressure measurements (SCPTu), are valuable tools for various engineering applications. These tests can estimate geotechnical parameters and classify soils over a broad range of soil types, from very soft soil to weak rock. Over the past few decades, various soil behavior charts have evolved for soil classification based on CPT-measured data [
1,
2,
3,
4,
48].
One such chart was developed by Robertson [
8] and can be used to classify soils into different categories, such as sand, clay, silt mixture, organic soil, and more. An example of such a chart is shown in
Figure 2, which illustrates the classification of soil types ranging from sensitive clays to very stiff over-consolidated (OC) clays. The chart categorizes soils into various classes or zones based on their soil behavioral type index (Ic) determined by Equation (1) [
48].
Table 1 lists boundaries for classification based on Ic values. In this study, the zone numbers (see
Figure 2) are directly used as ML labels as they represent the soil types in a straightforward and intuitive way.
where
is corrected cone resistance or CPT cone resistance
,
is atmospheric pressure in the same unit as
,
is friction ratio, and
is CPT sleeve friction.
Although existing empirical correlations work well with the CPT data, their applicability is limited to primarily fine-grained soils. Additionally, CPT and core drilling techniques work together to provide more detailed information about subsurface soil properties [
12].
3. Datasets
For our study, we used publicly available CPT datasets contributed by [
47], which were accessible in the International Society for Soil Mechanics and Geotechnical Engineering (ISSMGE) database. The CPTs were collected from an area measuring 50 by 50 m. Each CPT was performed to a depth of 5 m below the ground surface, and the measurement spacing of qc and fs was 5 mm. Further information about the specifics of the CPTs can be found in [
47,
49,
50].
We preprocessed the datasets using MS Excel (see
Figure 3) to categorize the soil behavior types based on Robertson’s classification [
48]. In order to reduce bias and ensure that the training and testing datasets are representative of the overall dataset, we shuffled the dataset and divided it into training and testing datasets using (80, 20) ratio.
The steps followed to preprocess data are as follows:
(1) Combine the individual CPT soundings into the appropriate columns (e.g. depth, qc, and fs) using Power Query in MS Excel and remove the missing values. (2) Calculate the inter quartile range (IQR) values for the qc and fs columns using Excel’s built-in functions such as QUARTILE. (3) Determine the upper threshold values for outlier detection by multiplying the IQR by three and adding the third quartile. (4) Identify the outlier values in the qc and fs columns using conditional formatting. (5) Remove the outlier from the dataset and replace the values with the threshold value. (6) Estimate the , total vertical stress , effective vertical stresses , and Ic.
Table 2 presents a statistical summary of the datasets organized into 222,100 rows and 7 columns. The frequency distribution of each soil type in the dataset is shown in
Figure 4. The distribution analysis demonstrated that soil type 5 has the highest frequency and represents over 50% of the total dataset. Soil type 4 has the second highest frequency and represents over 30% of the dataset. Soil types 2, 3, 6, and 7 (minority class) have much lower frequencies and represent less than 20% of the dataset combined, indicating an imbalanced dataset. Balancing this highly imbalanced dataset using oversampling or under sampling techniques may be possible, but it can also affect the natural variability of the soil, potentially leading to biased predictions and incorrect soil classification. To avoid this, we opted to train the ML models on the imbalanced datasets and evaluate their performances using appropriate evaluation metrics such as sensitivity, precision, and F1_score, instead of artificially generating or discarding soil samples that could impact the true variability of the soil.
The input features, which include depth, qc, and fs, are raw data directly obtained from the test. In contrast, the friction ratio
, total vertical stress
, and effective vertical stresses
are results from empirical correlations (Equations (2), (3) and (5), respectively).
where
is total vertical stress,
is unit weight of soil and h is depth of soil.
The unit weight of the soil is estimated using the following expression [
51]:
where
is unit weight of soil,
is unit weight of water in the same unit as
,
is cone tip resistance, and
is atmospheric pressure in the same unit as
.
where
is effective vertical stress,
is total vertical stress,
is unit weight of water in the same unit as
, and
h is depth of soil.
4. Machine Learning Models
ML is a subfield of artificial intelligence (AI) that aims to develop algorithms and statistical models to help computer systems improve their performance on specific tasks by learning from the data [
52]. The types of learning include supervised, unsupervised, and reinforcement [
53]. While the supervised and reinforcement learning algorithms can involve human supervision, the unsupervised learning algorithms do not rely on labeled data or human guidance.
Our study utilized the supervised ML algorithms to classify soils using the CPT datasets. We trained four different ML algorithms, ANN, RF, SVM, and DT, using training CPT datasets and tested their performance on test datasets via R programming language [
54]. In the following section, we discuss each of the ML algorithms to gain insight into their strengths and limitations.
4.1. Artificial Neural Network Model
ANNs are ML models that draw inspiration from the human brain’s structure and functions [
35]. They comprise interconnected neurons that use weighted connections to process and transmit information. ANNs can learn data patterns and relationships by modifying the connection strength based on the input and output. ANN models typically contain three layers, including input, hidden, and output layers.
Figure 5 presents an example of an ANN model that includes an input layer with 6 neurons, 2 hidden layers with 16 and 8 neurons, and an output layer. Deep learning is commonly used to describe neural networks with many hidden layers.
In the ANN models, weights (the connection strength between neurons) and activations (output of a neuron in the network) are fundamental elements that enable the network to learn patterns and relationships in the data. The weights in an ANN are adjusted during the learning process to optimize the model’s performance. At the same time, activation applies a mathematical operation to the input and transmits an output to the other neurons in the network.
Choosing the proper activation function is essential when dealing with an ANN model. There are several types of activation functions, namely, Sigmoid function (commonly used for binary classifications), ReLU (rectified linear unit) function, Tanh (hyperbolic tangent) function, and Softmax function (commonly used in the output layer).
Our study considers an ANN model with 2 hidden layers containing 128 and 32 neurons and an output layer. We implemented our models using the Keras package [
55], which provides an easy-to-use interface for building and training neural networks. A multi-layer perceptron (MLP) model provided the soil classification with the ReLU activation function in both hidden layers, and the Softmax activation function in the output layer. The Keras library in R aided the model development, which was compiled using the categorical cross-entropy loss function, the Adam optimizer, and accuracy as the evaluation metric. The model was trained on the training data for 200 epochs, using a batch size of 32 and a validation split of 0.2. The model would learn from the data and adjust its weights and biases to minimize the loss, which measures the difference between the predicted and actual values.
The categorical cross-entropy loss function is used to measure the difference between the predicted and actual values in a classification task. To minimize this loss, the Adam optimizer adjusts the weights and biases of the model during training. Accuracy, on the other hand, is a metric that evaluates how well the model generalizes to new, unseen data by measuring the percentage of correct predictions. The performance of the model was improved through the Bayesian optimization fine-tune of its hyperparameters including dense units 1, dense units 2, dropout 1, dropout 2, and batch size.
4.2. Random Forest Model
Random forest is a widely used ensemble learning algorithm for both classification and regression tasks. The algorithm employs multiple decision trees to improve the model’s accuracy and robustness. Unlike individual decision trees, random forest is less prone to overfitting as it combines multiple trees with varying biases and variances. Additionally, it can efficiently handle high-dimensional data with many features by randomly selecting a subset of features for each tree. As a result, the algorithm is capable of handling large and complex datasets [
36,
37,
56].
In our research, we utilized the random forest algorithm to train a model using the random forest package [
57] in the R programming language. We fine-tuned the model’s hyperparameters, including the number of variables randomly sampled at each split of a decision tree (mtry), the minimum number of internal node size (min.node.size), and the number of decision trees (ntree), using a model-based Bayesian optimization technique. The performance of the model was evaluated using cross-validation, and we selected the optimal values of the hyperparameters based on its best performance.
4.3. Decision Tree Model
Decision tree (DT) is a widely used machine learning algorithm that can be applied to both classification and regression problems. It is a non-parametric algorithm that can handle large and complex datasets without imposing a rigid parametric structure, making it a versatile tool for various applications [
57]. The DT algorithm builds a tree-like model where the internal nodes of the tree represent decisions based on input features, while each leaf node represents class labels or target values. DT models are particularly suitable for multi-class classification problems due to their ability to capture non-linear relationships between input features and target variables [
58,
59].
For our soil classification problem, we utilized the rpart package [
60] in the R programming language to implement a decision tree model. We fine-tuned the model’s hyperparameters, including the complexity parameter (cp), the maximum depth of trees, the minimum split, and the maximum number of competitor splits, using Bayesian optimization. We used cross-validation to prevent overfitting and improve the model’s ability to generalize to new data.
4.4. Support Vector Machine Model
SVM is a well-known supervised ML algorithm frequently utilized for multi-class classification and regression problems [
38,
39]. The SVM algorithm operates by locating the optimal hyperplane that segregates the input data points into distinct classes. The hyperplane locates itself by maximizing the margin, which is the gap between the hyperplane and the nearest data points of each class. For our study, we employed the e1071 R package [
61], which offers an SVM implementation in R. This allowed us to train a model using the training data and assess its effectiveness on the test data. To ensure a well-tuned and generalized model, we used cross-validation to optimize the hyper-parameters (cost and gamma).
5. Results and Discussion
In the following subsections, the results of the ML models are presented and discussed using confusion matrix and various performance metrics such as overall accuracy, sensitivity (ability to detect positive instances), specificity (ability to detect negative instances), negative predicted value (NPV), positive predicted value (PPV), and balanced accuracy (the average of sensitivity and specificity). Due to the imbalanced dataset used for the training and testing purposes, additional informative performance metrics such as precision, recall, and F1_score are utilized to assess the efficacy of the ML models.
where
= True Positive (number of samples correctly predicted as positive),
= True Negative (number of samples correctly predicted as negative),
= False Positive (number of samples incorrectly predicted as positive), and
= False Negative (number of samples incorrectly predicted as negative).
5.1. Artificial Neural Network Model Results
The results of the ML model implemented utilizing ANN to classify different soil types are presented here.
Figure 6 displays the accuracy and loss of the ANN model for 200 epochs on both the training and validation data. At the beginning of the training, the model has a low accuracy of 0.79 and a high loss of 0.63 values, indicating that it cannot make good predictions. However, as training progresses, the accuracy improves, and the loss decreases, indicating that it gradually improves its ability to make more accurate predictions. When the validation accuracy and loss metrics improve, it suggests that the model is generalizing well to new data, which is a desirable outcome.
Table 3 displays the confusion matrix of the ANN model, which provides insight into the model’s performance on the test data. The rows correspond to the predicted values, while the columns correspond to the actual values. The diagonal elements of the confusion matrix represent the number of instances that the model correctly classified, while the off-diagonal elements correspond to the misclassifications made by the model.
Statistics by class (
Table 4) show that the model has a high sensitivity for all soil types except type 7, with a low sensitivity value of 0.67. Additionally, the model has a high specificity for all classes, with values ranging from 0.99 to 1.0.
The positive predicted value (PPV) and negative predicted value (NPV) are important performance metrics in evaluating the effectiveness of a classifier. A high PPV indicates that it is likely correct when the model predicts a sample to belong to a particular class. On the other hand, a high NPV indicates that when the model predicts a sample to not belong to a particular class, it is likely to be correct. The ANN model results show a high PPV for all soil types, with values ranging from 0.91 to 0.99. Similarly, the NPV is high for all soil types, with values ranging from 0.99 to 1.
In summary, the ANN model has an overall accuracy of 98.82%, showing that the model performs well in classification tasks. However, the model struggles to predict class 7 (minority class), given a low sensitivity value.
5.2. Random Forest Model Results
Table 5 displays a confusion matrix that compares the soil types predicted by the RF model with the actual soil types. The confusion matrix shows that the model made some correct and incorrect predictions for each class. For example, the model correctly predicted 835 samples as class 2. The model has high diagonal values, signifying a high number of correct predictions, and low off-diagonal values, implying a low number of misclassifications. The model’s overall accuracy is very high, indicating the model’s performance. It achieved a 99.23% accuracy, indicating that it effectively predicts soil types.
The statistics by class (
Table 6) show that the RF model has a high sensitivity and specificity values for all soil types. Overall, it performed well in the classification task, achieving high scores for multiple performance metrics such as PPV, NPV, and balanced accuracy.
5.3. Decision Tree Model Results
Table 7 presents the confusion matrix for the DT model utilized for the soil classification task. The table evaluates the performance of a predictive model in classifying different soil types based on input features. The number of observations that were accurately predicted by the model (diagonal entries) and the number of misclassifications (off-diagonal entries) appear in the table. Based on the model’s confusion matrix, the model performed well in the classification task, as the number of correctly predicted values are significantly higher than the number of misclassifications. The overall accuracy of the model in predicting soil types on the test dataset was 95.67%.
The statistics by class (
Table 8) show that the DT model has a high sensitivity and specificity for all soil types. Moreover, the model exhibits a high balanced accuracy with values ranging from 0.96 to 0.99. Overall, the model performed well in the classification task, with high scores for multiple performance metrics across each soil type.
5.4. Support Vector Machine Model Results
The results of the ML model implemented utilizing the SVM to classify different soil types are presented here. The confusion matrix computed with the SVM model to evaluate its effectiveness is presented in
Table 9. The confusion matrix shows that the model predicted almost all instances correctly, with a few misclassifications in each soil type. The overall accuracy of the model is very high (almost 100%), indicating that it is a high-performing model.
Table 10 shows the class distribution summary for the SVM model. The model’s sensitivity is high for each soil type, indicating that the model is good at correctly identifying the positive cases for each soil type. The model’s specificity is also high for all soil types, indicating that the model is good at correctly identifying the negative cases for all soil types. Moreover, the model’s balanced accuracy (the average of sensitivity and specificity) is remarkably high (almost 1) for all soil types. This shows that the model can accurately identify both positive and negative cases, making it a reliable classifier for the soil classification task.
The model exhibits high PPVs for all soil types, indicating its strong ability to predict the samples of specific soil types accurately. Similarly, the model shows high NPVs for all soil types, indicating its reliability in predicting the samples that do not belong to a particular soil class. Overall, the model performs exceptionally well on the dataset in terms of multiple performance metrics.
5.5. Comparison of ML Models’ Performance
To compare the efficiency of the ML models, different performance metrics such as overall accuracy, sensitivity, precision, and F1_score are utilized. The results of this evaluation are summarized in
Table 11 and
Table 12.
Table 11 shows that the SVM model achieved the highest overall accuracy of 99.84%.
The ANN, RF, and DT models also performed well, achieving overall accuracies of 98.82%, 99.23%, and 95.67%, respectively. It is important to note that the datasets were imbalanced, and therefore, it is necessary to consider both the overall accuracy and other performance metrics for each soil type to accurately assess the ML models’ performance.
Table 12 presents the performance metrics of the ML models for each soil type. The table shows the sensitivity, precision, and F1_score values of each model and soil type. These metrics indicate the models’ efficiency in correctly identifying the soil type. Across all models, the sensitivity, precision, and F1_score values for each soil type are very high, indicating that the models successfully identified instances of all classes. However, the efficiency of the ANN model on minority class 7 was low compared to the other models. It scored lower sensitivity and F1_score values of 0.67 and 0.77, respectively, compared to the SVM and RF models with almost perfect scores for all metrics. This indicates that the ANN model needs additional data to better identify minority classes.
The SVM and RF models outperformed the ANN and DT models in terms of sensitivity, precision, and F1_score values for all soil types. These two models achieved almost perfect scores for all performance metrics for all soil types, indicating their high accuracy in the classification task. Overall, the performance of the ML models in classifying soils based on the CPT dataset is consistent with previous similar research carried out on ML techniques (e.g., see [
12,
21]).