1. Introduction
The manufacturing procedures of Ordinary Portland Cement (OPC), which is the cementitious material that is extensively used in concrete all over the globe, are connected with a high energy demand and large CO
2 emissions. This is because of the fact that OPC is the cementitious material that is used in concrete [
1,
2,
3]. The production of OPC is responsible for around 4 billion tons of carbon dioxide emissions each year, which accounts for approximately 5–7% of the total CO
2 emissions globally [
4,
5]. Many strategies have been implemented to try to lessen the impacts of OPC creation and usage due to growing environmental preservation problems and climate change implications [
6,
7,
8]. To reduce the dependency on organic matter cement (OPC), the use of supplemental cementitious materials (SCMs) and the recovery of materials that have not been utilized are both included in the aforementioned procedures [
9,
10,
11]. Despite this, there are usually controls on the ratios of these SCMs substituted for OPC [
12,
13]. A good example of this would be fly ash, which, although displaying pozzolanic qualities throughout the different phases of OPC hydration, has only a marginal impact on the early stages of strength development [
14,
15,
16]. Due to the fact that the incorporation of fly ash may slow down the pace of early hydration and extend the amount of time it takes for the material to set [
17,
18], its use in large quantities is restricted. In order to totally replace OPC, one of the most investigated methods is to use alkali activation to produce cementitious binders that are less harmful to the environment [
19,
20]. OPC clinker may be produced without the high-powered, elevated-temperature heating process [
21,
22]. Alkali-activated materials, also known as AAMs, do not need this approach. AAMs, which are sometimes referred to as geopolymers, are polymeric alum inosilicate cementing components that have three-dimensional spatial complex topologies. These ingredients are activated using an alkaline agent, such as sodium hydroxide or sodium silicate, and are mostly constituted of industrial wastes such as fly ash [
23,
24]. Geopolymers are characterized by their one-of-a-kind chemical makeup, which confers upon them exceptional mechanical performance and durability. The method’s main binder component is the reuse of waste resources, making it more environmentally friendly than OPC-based mixtures [
25,
26].
In an attempt to achieve a reduction in the number of repeats that are not essential in experiments and the amount of ingredients that are wasted, predictive models for material strength are currently being created. Best-fit curves, which are generated based on regression analysis, are among the many popular models that are used in the process of simulating the characteristics of concrete. On the other hand, with cementitious materials being nonlinear in nature [
27], regression approaches that are developed in this way would not adequately capture the fundamental behavior of the material. In addition, regression methods have the potential to inaccurately estimate the importance of certain components [
28]. Artificial intelligence modeling approaches, such as controlled machine learning (ML), are some of the most advanced and well-established techniques used in contemporary research [
29,
30,
31,
32,
33,
34,
35]. Fuzzy systems and fuzzy numbers have found acceptable applications in mining and civil engineering domains [
36,
37,
38]. For the purpose of modeling responses, these approaches make use of input variables, and the yield models are backed by experimentation. In order to predict the features of concrete and bituminous mixes, machine learning approaches are used [
39,
40,
41,
42]. While most previous machine learning-based studies [
43,
44,
45] concentrated on predicting the compressive strength (CoS) of substances using orthopedic polymer innovation, only a few studies focused on forecasting the characteristics of geopolymer mixture mixtures.
The use of machine learning techniques in civil and concrete engineering projects has also received a lot of attention. Mustapha et al. [
46] evaluated gradient-boosting ensemble models for quaternary mix concrete compressive strength prediction in great detail. Their results show that CatBoost excels in predictive accuracy, achieving an R
2 value of 0.9838, showcasing notable enhancements compared to other gradient-boosting models. In a similar fashion, Alhakeem et al. [
47] utilized a combination of a hybrid Gradient Boosting Regression Tree (GBRT) model and GridSearch CV hyperparameter optimization to forecast the compressive strength of environmentally friendly concrete. Grid search optimization significantly improved model performance, resulting in an R
2 of 0.9612 and an RMSE of 2.3214 for this hybrid model. Additionally, Faraz et al. [
48] performed studies on the prediction of metakaolin concrete’s compressive strength, utilizing Gene Expression Programming (GEP) and Multigene Expression Programming (MEP). The highest MEP model had a R
2 value of 0.96, demonstrating that MEP models outperformed GEP models. This research also found that the water–binder ratio, superplasticizer percentage, and age are crucial factors that affect compressive strength. Shah et al. [
49] applied MEP to simulate the mechanical characteristics of concrete made from E-waste aggregates, obtaining excellent precision with R-values above 0.9 for forecasting both compressive and tensile strength. The water–cement ratio and the percentages of E-waste aggregate were shown to be the most significant elements in the sensitivity study.
These studies show the potential of advanced machine learning models, such as ensemble and hybrid approaches, in predicting sustainable concrete properties accurately. They also emphasize the significance of optimizing parameters and conducting sensitivity analysis to improve model performance and comprehend the impacts of different input factors.
Dey et al. [
50] saw promise in geopolymer concrete’s ability to include waste elements such as MT and recovered glass powder (GP) as environmentally friendly alternatives. Their research utilized response surface methodology to determine the best material ratios for achieving the highest compressive strength. Thorough assessments were carried out on the fresh-state properties, mechanical traits, and long-term durability of concrete mixtures. The researchers discovered that GP enhances the ease of work whereas MT reduces it because of its higher fineness and larger surface area. Adding GP and MT was said to improve compressive strength by as much as 25% while the use of only GP may have slightly decreased mechanical properties. Both materials have a limited impact on flexural and splitting tensile strengths compared to compressive strength. GP and MT mixtures surpass standard benchmarks in durability tests, which include quick chloride permeability tests and 300 freeze–thaw cycles. The research found that combining GP and MT enhances durability and boosts mechanical properties, highlighting their potential as eco-friendly alternatives in making concrete. In a separate study, Martini et al. [
51], to further the development of environmentally responsible building practices, investigated the mechanical properties of concrete mixes that included recycled concrete aggregate (RCA) from structures that had been destroyed in Abu Dhabi. With varying percentages of recycled aggregate (0%, 20%, 40%, 60%, and 100%), they used ground granulated blast furnace slag and fly ash as supplementary cement ingredients in seventy concrete mixtures. By conducting tests involving compression from one direction and bending, the researchers discovered that concrete containing 20% recycled concrete aggregate (RCA) had a strength of more than 45 MPa, making it suitable for use in structures.
The compressive strength of geopolymer concrete (CoSGePC) has been the subject of several efforts to be calculated in an environment that is monitored, which are also referred to as direct determination. Wakjira et al. [
52] have introduced a fresh approach for predicting strength and conducting the multi-objective optimization (MOO) of ultra-high-performance concrete (UHPC) that is both economical and friendlier to the environment, enabling smart, sustainable, and resilient construction methods. Their structure combines a range of tree and boosting ensemble machine learning models in order to create a precise and trustworthy prediction system for the uniaxial compressive strength of UHPC. Their optimized models have been merged to create a super learner model, which acts as a strong predictive tool and one of the optimization objectives in the multi-objective optimization problem. T.G. Wakjira and M.S. Alam [
53] developed a predictive model using interpretable machine learning (ML) to address challenges in the performance-oriented seismic design (PBSD) of ultra-high-performance concrete (UHPC) bridge columns. UHPC, valued for its exceptional strength, toughness, and durability, encounters a substantial challenge in accurately measuring damage levels with suitable engineering demand parameters (EDPs). The aforementioned authors’ research attempts to close that divide by forecasting the drift ratio threshold conditions of UHPC bridge columns through four different stages of damage.
Table 1 contains various literary models used to forecast different properties of concrete. Many other ML techniques employed to predict the properties of the CoSGePC besides some of these techniques are the Support Vector Machine (SVM), Gene Expression Programming (GEP), Artificial Neural Network (ANN), Decision Tree (DT), random forest (RF), Data Envelopment Analysis (DEA), Response Surface Methodology (RSM), Adaptive Neuro Fuzzy Inference System (ANFIS), Micali–Vazirani Algorithm (MV), Retina Key Scheduling Algorithm (RKSA), Gradient Boosting (GB), Gaussian Process Regression (GPR), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine Regression (SVMR), Nonlinear Regression (NLR), Multi-linear Regression (MLR), Linear Regression (LR), Pure Quadratic (PQ), Interaction (IA), and Complete Quadratic (FQ). These methods are highlighted in
Table 1.
The significance of the research presented in this paper lies in its innovative approach to improving the prediction accuracy of compressive strength (CoS) in geopolymer composites (GePCs). Geopolymers are environmentally friendly alternatives to Ordinary Portland Cement (OPC), whose production is responsible for significant CO2 emissions. The paper addresses the critical need for accurate predictive models to reduce experimental redundancies and resource waste in GPC development. Traditional regression models often fail to capture the nonlinear behaviors of cementitious materials and may inaccurately estimate the importance of certain components. By contrast, this research leverages advanced machine learning (ML) techniques, specifically supervised machine learning (SML) models, to enhance prediction accuracy. The researchers used data from various scientific publications, paying particular attention to important input variables including fly ash, fine aggregate, ground granulated blast furnace slag (GGBS), sodium hydroxide (NaOH) molarity, and other similar factors. They compared two hybrid models—the Harris Hawks Optimization with Random Forest (HHO-RF) and Sine Cosine Algorithm with Random Forest (SCA-RF) models—to traditional models. The results show that the hybrid models, especially the SCA-RF model, significantly improve prediction accuracy, as evidenced by performance metrics like the mean absolute error (MAE), root mean square error (RMSE), variance accounted for (VAF), and the coefficient of determination (R2). This research provides valuable insights and methodologies for developing more efficient and accurate predictive models in the field of geopolymer composite materials, contributing to more sustainable construction practices.
2. Research Methodology
2.1. Harris Hawks Optimizer (HHO)
A machine learning tool called the HHO mimics the predator–prey dynamics of the Harris hawk, which include the elements of exploration, the process of transformation, and exploitation. The method needs less parameter tweaking and may perform global enquiries. As a result, it has a lot of mining capacity.
Harris hawks are nocturnal hunters that use a combination of two methods to locate their prey:
Here, the subsequent repetition’s value is
X(t) and the place in which the current iteration occurs is
X(
t+1). There are currently
t iterations. The prey site is
Xrabbit (t), or a particular place with the highest level of wellness, and the selected independent participant is
Xrand (t). The integers
r1,
r2,
r3,
r4, and
q are randomly selected in [0, 1]. The strategy to be utilized is chosen at random using
q. The equation for the average location of people in general, denoted as
Xm (t), is demonstrated by Equation (2).
The symbol Xk (t) represents the k-th person in the groups. The group’s size is N.
When the prey leaves, HHO alternates between searching and other development activities. Equation (3) explains escape power.
Here,
T is the greatest number of repeats,
E0 is the random value in [−1,1], and t is the number of repetitions.
|E| < 1 enters the development phase whereas
|E| ≥ 1 enters the analysis stage (see
Figure 1).
This phase selects several growth methods by using
r, a random number in [0, 1] (
Figure 2).
As Equation (4) shows, it updates the location using the gentle siege technique whenever 0.5 ≤
|E| < 1 and
r ≥ 0.5.
As described in Equations (5)–(7), the phase adjusts the location using the soft enveloping technique of asymptotic rapid subduction when 0.5 ≤
|E| < 1 and
r < 0.5.
Here, Levy is Levi’s flight, Dim is the issue dimension, S is the random vector f() of the Dim dimension, and the components within are random values in the range of [0, 1].
As indicated by Equations (8)–(10) for updating the location, the phase uses the complex enveloping method of asymptotic rapid descent when
|E| < 0.5 and
|E| < 0.5.
The pseudo-code for HHO in Algorithm 1 is provided. HHO may change its developing habit from discovery to production depending on the escape power of the prey. During the flight behavior, the prey’s power is drastically decreased.
Algorithm 1. Pseudo-code of HHO |
1 | Initialize the parameters popsize, MaxFes |
2 | Initialize a set of search agents (solutions) (X) |
3 | While(t ≤ MaxFes) |
4 | Calculate each of the search hawks by the objective function; |
5 | Update Xrabbit(best loaction) and best fitness |
6 | For i = 1 to popsize |
7 | Update the E by Equation (3); |
8 | Update the J; |
9 | If (│E│ ≥ 1). |
10 | Update the position of search agents using Equation (1) |
11 | End If. |
12 | If (│E│ < 1). |
13 | If (0.5 ≤ │E│ < 1 and r ≥ 0.5). |
14 | Update the position of search agents using Equation (4); |
15 | End If. |
16 | If (│E│ < 0.5 and r ≥ 0.5). |
17 | Update the position of search agents using Equation (5); |
18 | End If. |
19 | If (0.5 ≤ │E│ < 1 and r < 0.5). |
20 | Update the position of search agents using Equations (6)–(8); |
21 | End If. |
22 | If (│E│ < 0.5 and r < 0.5). |
23 | Update the position of search agents using Equations (9)–(11); |
24 | End If. |
25 | End If. |
26 | End For. |
27 | End While. |
28 | Return Xrabbit and best fitness. |
HHO mimics different hawk behaviors during the hunting phase but does not represent other cognitive behaviors. Many animals and other species in nature exhibit certain social behaviors and hierarchies. To reinforce the individual relationships within the HHO population size, we thus implemented a hierarchy. This allows people with high levels of fitness to take the lead and guide the whole population to adjust their places appropriately. We initially rate everyone’s fitness (fitness), and we designate A, B, and C as the three people who are the most fit, accordingly. These people receive updates individually.
Participant
A’s position adjustment formula is expressed in Equation (11). By assessing the percentage of the staying duration of the method to the overall running periods with the Cauchy unpredictability amount, it is evident that the current location can move towards the ideal location. As the substitute’s probability increases, the likelihood of this happening becomes less significant. This ensures that the method can prevent getting stuck in local optimization.
In this case, the character j denotes a certain dimension. The amount of the j-th category for every member A’s subsequent iteration is indicated by the formula Xj(t+1). At the same time, the j-th measure of the optimal position for the t-th loop is represented by the equation .
and
represent an arbitrary choice of two people from the sample, with the condition that
m and
n are not the same, and neither is person
A. The variable rand represents an integer with a probability in the range of [0, 1]. The variable
t represents the current count of repetitions while MaxFEs represents the most significant number of repetitions. The quantity of
G is determined by Equation (12).
The location refresh equation for person
B is represented by Equation (13).
The value of subcategory j for the individual B after the next repetition is represented by the parameter Xj(t+1). For individual B, the quantity in the current cycle of dimensions, j, is represented by Xj(t). The value of k defines the unsupervised selection of an aspect from the available ones. represents the adoration of size j for individual A’s present reiteration whereas represents the admiration of size j for person B’s contemporary repetition.
Lastly, we have person
C, and their location update algorithm appears in Equation (14).
In this case, Xj(t+1) represents the outcome of each person’s j dimensions’ following iteration C. Q and R stand for the two individuals chosen at random from the pool of people and and stand in for the j-scale values for the randomly chosen individuals in this rendition. Similarly, represents the j dimension value for person A’s current repeat. The value of the j size for individual B’s current iteration is indicated by the notation . The current iteration of person C’s j size is denoted by the notation. .
We have assigned three exceptional people to oversee local growth while others are responsible for carrying out the original refreshing of HHO. Simultaneously, we provide the most favorable location for the person A using a diminishing function. The method’s substitution frequency diminishes as it progresses, allowing for local growth inside the examined area. Both persons B and C have a connection to individual A, which might enhance the communication among exceptional individuals. The updating technique of these three remarkable people is more accessible than that of the initial HHO. But EHHO has a higher speed than HHO. Our algorithm enhancement aims to enhance the resolution rate and precision, as well as the runtime efficiency and accuracy in feature selection duties, despite expanding computational difficulty.
2.2. Sine Cosine Algorithm (SCA)
In 2016, Mirjalili presented the SCA, a population-level optimization technique [
85]. The basic principle behind SCA is that by utilizing Equations (15) and (16), every answer will adjust its location to the location of the most effective answer in the search area.
Hence, at iteration
k,
denotes that particular solution’s location in the
i-th vector. The
i-th degree of the greatest answer found so far is represented by
, the three random factors are
r1,
r2, and
r3, and the final value is indicated by Equations (15) and (16), which have been merged for their last update (see
Figure 3), as is apparent in Equation (17), to streamline the formulas.
The goal of any metaheuristic technique should be to balance the processes of exploration and exploitation properly. Equation (18) illustrates how SCA achieves the equilibrium between discovery and extraction via optimization by reducing the region of sines and cosines.
Here, the factors
K and
K are the most significant number of and the number of current repetitions, respectively.
A is a fixed value.
Figure 4 shows how to reduce the sine and cosine area with repetitions at
a = 3. The SCA method’s pseudo-code is shown in Algorithm 2.
Algorithm 2. Pseudo-code of SCA. |
Random initialization of population of search agents (solutions) (X) |
Solution evaluation by the objective function |
P = the optimal solution found so far. |
while (k < K) do |
Update r1, r2, r3 and r4 |
for each search agent in the population do |
if (r4 < 0.5) then |
| |
else if (r4 ≥ 0.5) then |
|
Estimate the value of objective function for each search agent. |
Update P |
k = k + 1. |
return P |
2.3. Random Forest (RF) Algorithm
Ensemble learning techniques include the random forest algorithm. When dealing with data that have a lot of dimensions, it works well. In order to reduce the risk of model overfitting and increase overall accuracy, it builds many decision trees during training and then merges their classification outputs [
86]. For the purpose of generating new training datasets, the random forest model uses a combination of random sampling and replacement. This method of random sampling reduces the impact of individual samples while increasing the model’s variety and resilience. The decision trees are constructed using the random forest model using a feature-random-selection approach. In order to choose candidates, split attributes at each decision tree node, a subset of the whole feature set is randomly selected. (see
Figure 5). The model’s capacity to generalize is enhanced by reducing feature correlation via feature random selection. Last but not least, the random forest model uses voting to decide on classifications. Each decision tree makes a forecast for a classification job and the ultimate outcome is decided by a majority vote. The vote result may be written as follows, assuming the collection of classes is
and
represents the result of prediction of the decision tree
hi for class
ci.
3. Data Presentation
Supervised machine learning (SML) methods necessitate an assortment of input factors to achieve the desired predictive outcomes [
87]. Information on geopolymer composites’ compressive strength (CoS) values was derived from a number of scholarly articles in this study [
88,
89,
90,
91,
92,
93,
94,
95,
96,
97,
98,
99,
100,
101,
102,
103,
104,
105,
106,
107,
108,
109,
110,
111,
112,
113,
114,
115,
116,
117,
118,
119,
120,
121,
122,
123,
124,
125,
126] (see
Supplementary Materials). A random selection of experimental data was made from the existing literature to guarantee objectivity. Unlike many studies that focused on various properties of GePC, this research specifically collected data points related to CoS to facilitate the execution of the algorithms. The input variables for the algorithms included the fine aggregate, ground granulated blast furnace slag (GGBS), fly ash, sodium hydroxide (NaOH) molarity, NaOH quantity, water-to-solids ratio, sodium silicate (Na
2SiO
3), and gravel sizes of 10/20 mm and 4/10 mm, with CoS serving as the target output parameter. The performance of SML models is significantly influenced by the number and variety of input variables and datasets utilized. For this study, a total of 371 data points were compiled and used to run the machine learning algorithms, as detailed in the
Supplementary Materials. These data points were selected based on mix proportions and the desired outcome, ensuring that each model had a consistent number of input parameters to generate the required outputs. Since the data were extracted from the existing literature, the experiments reflected variations in geographical locations, testing setups, and sample geometries. However, these differences did not impact the primary conclusions of the study as the models focused solely on input variables and their corresponding outcomes independent of the specific testing conditions. The descriptive statistics for each input variable are presented in
Table 2. The data underwent a normalization process, which is a standard technique in data management. Normalization involves organizing data in a database to minimize redundancy and dependency issues, thereby enhancing the database’s flexibility and integrity. Descriptive statistics encompass a range of measures that provide concise summaries of data, whether representing an entire population or a subset. The mean, the median, and the mode are examples of measures that show core patterns. On the other hand, the maximum, the minimum, and the standard deviation highlight the variability that exists within the data.
When it comes to the input variables to the model,
Table 2 has all the statistical jargon you could want.
Figure 6 illustrates the violin plot of each input factor relative to the compressive strength. The diagonal plots show the frequency distribution while the off-diagonal representation shows the correlations between the input parameters and the output parameter. A comparable connection with the y-axis input/output parameter is shown by a trend in the line graph that is either positive or negative for each of the input/output parameters on the x-axis. In contrast, a straight line indicates no correlation between the parameters.
Figure 7 further demonstrates the correlation patterns between the input parameters and the compressive strength values [
127]. The graphical representation helps one visualize the nature and strength of these relationships, providing insight into how different input variables influence the output parameter, compressive strength, in the context of geopolymer composites. A summary of the dataset for anticipating the CoSGePC is listed in
Table 3.
Table 3 of the paper provides a summary of the dataset used in our study. The input variables include the fine aggregate (FA), ground granulated blast furnace slag (GGBS), sodium silicate (Na
2SiO
3), sodium hydroxide (NaOH), water-to-solids ratio (WS), and gravel sizes (4/10 mm, 10/20 mm), among others. The output variable is the compressive strength of geopolymer composites (CoSGePC). This dataset was sourced from multiple studies (as referenced) and serves as the basis for training our machine learning models.
4. Evaluations and Verifications of the Models
The process of constructing an intelligent model involves the verification and assessment of the performance of the model. For this reason, the researchers decided to use four different assessment indices. VAF, MAE, RMSE, and R
2 are the variables that measure the accuracy of the data with respect to the variables under consideration and are all significant measures of statistical significance. There are explanations of these indices that may be found in the published research [
128].
Through the process of averaging the magnitude of the absolute error, the MAE is able to directly represent the inaccuracy in the forecast. In the event that the MAE value is low, it indicates that the projected CoSGePC is an excellent match for the actual CoSGePC. It is possible to acquire the expression of MAE as follows:
The term “Root Mean Square Error” (RMSE) is an abbreviation that stands for “Root Mean Square Error,” as the actual and projected CoSGePC are compared and the standard deviation of the regression error is computed. It has a high degree of responsiveness to the error, which magnifies the effect that it has on the final result. The root means square error, often known as the RMSE, is a mathematical statistic that represents the average variance between the values that were anticipated and those that are actually observed [
129]. The calculation for it involves calculating the square root of the average of the squared disparities that exist between the values that were anticipated and those that have been discovered.
A description of the performance of the prediction is provided by the VAF, which is provided in order to offer this description. For the purpose of accomplishing this objective, a comparison is made between the standard deviation of the fitting error and the standard deviation of the real CoSGePC. The use of it is not feasible if all of the values that are noticed are identical to one another. Specifically, the VAF is defined as follows:
The coefficient of determination, which is also often referred to as R
2, is a statistical indicator that is used for the aim of assessing the degree of linear relationship that exists between dataset parameters. The value of this coefficient is somewhere between one and zero.
Here, n is the sample size and CoSGePC, , and stand for the real CoSGePC, the expected CoSGePC, and the mean CoSGePC, respectively. The variable n represents the sample size while the variables and CoSGePC represent the actual CoSGePC, the predicted CoSGePC, and the mean CoSGePC, respectively.
Hybrid RF Model and Background
We present the process of constructing and estimating the suggested hybrid RF CoSGePC prediction model, which can be broken down into four steps:
In line with the Pareto principle, the database was first randomly split into two sets: the training set consisted of eighty percent of the total and the testing set consisted of twenty percent of the total. This division procedure was conducted based on the literature suggestions [
130]. In order to construct the prediction models and assess the efficacy of the models that were already in place, it was required to follow these stages. The training-set-to-testing-set ratio is 4:1, and it is often used because it has a high degree of prediction efficiency [
131,
132]. In the phases that follow, which are discussed in more depth, this ratio is described.
In order to reduce the effect of input variables having varying scales in the database and to save unnecessary computation costs, all datasets were normalized within the range of 0 and 1 [
133].
Both the total number of trees (ntree) and the number of features used to construct each tree (mtree) are ideal hyperparameters in the RF algorithm. The best RF models were found after searching for them using HHO and SCA.
The next stage was to assess the accuracy of the predictions made by the RF models that had been constructed by comparing them to both the training set and the testing set. This was achieved with the use of a Taylor graph and four evaluation metrics, which were as follows: the mean absolute error (MAE), root mean square error (RMSE), variance adjusted for (VAF), and R2.