1. Introduction
Data mining is an unavoidable stage in extracting knowledge and data gained from data mining, employed in various sectors, including industrial and medical applications [
1]. Recently, there has been a growth in the number of gathered and retained features in databases, though not all of them are valuable for data analysis; therefore, some of them are utterly useless or unnecessary. These traits, therefore, have no utility in the information extraction, but they mainly enhance the complexity and incompleteness of the outcomes. As a result, feature selection aids in reducing the dimensionality of data prior to data processing [
2]. There are n features in a vast database with numerous features to manage. The computation cost to assess all the features is exponential (O(
)), making it essentially unattainable. As a result, feature selection techniques serve as the foundation for data mining, allowing beneficial characteristics to be retained for further learning tasks while discarding the most irrelevant and less significant ones. In reality, feature selection approaches disregard unimportant features, allowing the learning process to be more successful [
3]. It has also been demonstrated that feature selection improves the classification performance of data mining algorithms such as the KNN classifier.
Mainly the three techniques to feature selection are the filter methods, embedded methods, and wrapper approaches [
4]. For filter methods, the selected features can be filtered depending on the general properties of the used datasets (i.e., known metrics, for example, correlation). Those methods can be implemented without predictive models. Filter methods are fast, but they face some problems in case of avoiding overfitting, and they may fail in selecting the best features. In contrast, the wrapper methods exist as wrappers around the predictive models, and they employ the predictive models to select the best features. The main drawbacks of those methods are their expensive computation, but they produce better performance. In the case of embedded, the process of selecting features is embedded in the learning model. Embedded models are computationally expensive than wrapper methods, but they can be considered better from the aspects of overfitting.
The Catfish BPSO presented in this work is a wrapper approach. A vast number of chosen features for many pattern classification issues does not necessarily result in an excellent accuracy rate. In some instances, the effectiveness of algorithms solely dedicated to data classification speed and predictive value can reduce even though features may be unimportant or confusing, or are due to positive correlations. During the learning stage, these characteristics might have a detrimental influence on the categorization process. Ideally, the feature selection approach decreases the cost of feature assessment while increasing classifier performance and quality. Several techniques have traditionally been used to choose features from training and testing data, including Arithmetic Optimization Algorithm [
5], Binary Butterfly Optimization [
6], Aquila Optimizer [
7], binary Gradient-based Optimizer [
8], Firefly Algorithm [
9], Atomic Orbit Search [
10], RUN Kutta optimizer (RUN) [
11], Colony Predation Algorithm (CPA) [
12], Slime Mould Algorithm (SMA) [
13], Harris Hawk Optimization (HHO) [
14], Hunger Games Search (HGS) [
15], and others.
A feature selection approach by ant colony optimization is given in [
16]. The approach uses numerous rounds to select the best feature subset without utilizing any learning techniques. Furthermore, the feature importance will be estimated using the correlation among features, resulting in reducing repetition. The experimental findings on numerous commonly used datasets demonstrate the proposed method’s efficiency and enhancements over earlier comparable approaches. This paper provides a new machine learning technique for high-dimensional data [
17], which uses the Henry gas solubility optimization (HGSO) method to pick key features and enhance classification performance. The suggested technique is assessed against well-established optimization algorithms using multiple datasets with a broad feature size range, from tiny to large. Finally, the empirical research indicates that the suggested method is significantly successful on low or high-dimensional data.
A new mixed ant colony optimization approach is presented in [
18] for feature selection utilizing a learning algorithm in this study. Choosing a subset of conspicuous characteristics of decreased size is an essential part of this approach. The suggested method employs a hybrid search methodology that combines the benefits of the filter and wrapper techniques. The specifics of the comparison demonstrate that the presented process has a surprising capacity to construct reduction subsets of prominent features while still giving high classification performance.
In [
19], a new feature selection technique using a mathematical framework of grasshopper interaction in discovering nutrition is suggested. The grasshopper optimization technique was modified to make it acceptable for a feature selection challenge. The proposed strategy is augmented by statistical measures to remove redundant features with the most interesting features during repetitions. Comparative trials show that the suggested approach is more effective than existing classification techniques. This work attempted to increase the effect of text classification using the particle swarm optimizer [
20]. Many exploratory search strategies are conducted in this study by examining current accomplishments of enhanced particle swarm optimizers and characteristics of traditional feature selection methods. The basic model is chosen first, followed by two upgraded models based on the structural inertia weight and steady restriction factor to optimize feature selection approaches. The trial findings and significance tests reveal that the dynamically upgraded model outperforms all others in text classification effectiveness and dimension reliability.
The work in [
21] suggests a non-negative inter feature selection method with variable graph restrictions to overcome the feature selection problem. In the presented model, linear regression is used to design the original data environment into a low-dimensional space to create the label matrix. The results demonstrate the efficiency of the suggested strategy on ten real datasets compared to other comparative approaches. A novel binary version of the grasshopper optimizer is presented and employed in [
22] for the feature subset selection challenge in this research. This suggested novel binary grasshopper optimizer is evaluated and analyzed to five optimization algorithms employed in the feature selection issue. These techniques have been developed and tested on different data sets of varying sizes. The findings showed that the suggested strategy outperformed the other approaches examined.
The paper [
23] provides a novel feature-selection search strategy for feature selection-based intrusion detection systems (IDS) using the cuttlefish optimization algorithm. Because IDS deal with a vast quantity of data, one of their most important duties is maintaining the highest quality of features that reflect the real data set while removing duplicate and unnecessary characteristics. Compared to the results produced utilizing all features, the feature subset derived via the proposed method provides a greater increasing security and correctness rate with a reduced probability of detection.
Unfortunately, Several issues are not addressed in the research mentioned above [
24]. To begin, all features are chosen at random with the same chance. As a result, the principal features cannot be quickly taken for inclusion in the newly generated feature subset. Moreover, the traditional feature selection techniques cannot adequately select the most informative features. Thus, the improved optimization methods are too near to determine the best relative features through an efficient search process. These methods significantly reduce the efficiency of searching for the ideal feature subset.
Motivation and Contribution
To some extent, population-based optimization algorithms can prevent local optima stagnation. It also has a high capacity to converge to the optima. One of the primary motivations for this research is that there is no suitable optimizer for addressing all kinds of problems, as given in No-Free-Lunch; hence the excellent version of any optimization method on a set of problems does not guarantee a compelling performance on another problems. To the aim to contribute, no one has yet used Aquila Optimizer using the leading search operators of the Whale Optimization Algorithm to tackle feature selection in a systematic manner. The authors selected the Whale Optimization Algorithm because of its proven efficiency and superiority compared with numerous algorithms such as PSO, GA, GWO, etc., in several optimization problems in different fields. Moreover, the logarithmic spiral function of WOA is an attractive operator to enhance the AO phases to cover a major area in uncertain search space. This was the primary motivator for us to select the Aquila Optimizer as the core of our work. This paper’s overarching focus is on providing new binary variants of the Aquila Optimizer, called IAO, for wrapper feature selection. The IAO enhanced the original search strategies of the Aquila Optimizer by using the main operators of the Whale Optimization Algorithm. This modification enables the IAO to tackle the main weaknesses of using a single search method by avoiding the local search problem and losing the diversity of the solutions in the search stage. The suggested technique identifies the best feature subset, which reduces feature subset size while increasing classification performance. The proposed IAO is evaluated on benchmark problems in terms of fitness values, the selected features number, and classification accuracy. The obtained results showed that the IAO got promising outcomes compared to different feature selection methods. Moreover, the IOA searchability is clearly observed in determining the best relative subset of features.
The following parts of the paper are arranged as follows,
Section 2 the Aquila Optimizer and Whale Optimization Algorithm are described.
Section 3 introduces the proposed algorithm for feature selection.
Section 4 shows experiments, results, and discussions.
Section 5, shows the conclusion and future work.
3. Proposed Method
Herein, this section provides description on the IAO method. It utilizes the benefits of the WOA to enhance the performance of the original version of the AO. In detail, the WOA is used as a local search of the AO to raise its capability in solving different optimization problems which adds more ability and flexibility to the IAO to explore and exploit the search space as well as improve its diversity.
The structure of the IAO is illustrated in
Figure 1. The IAO starts by declaring the global parameters and generating the initial population using random distribution methods. This population is evaluated to determine the best solution using the objective function. Throughout the optimization process of the IAO, the expanded exploitation of the original AO is improved using the spiral behavior of the WOA to update the solutions. In this regard, the expanded exploitation equation of the AO is replaced by the spiral equation of the WOA namely Equation (
8), as in steps 30 to 36 in Algorithm 1, are updated using the WOA Equation (
17). Therefore, the exploitation phase of the IAO benefits from both AO and WOA algorithms. Then, each solution is checked and updated by the objected function then the best one is retained to the subsequent iteration. Such sequence is iterated for all solutions till reaching stop condition, afterwards the best result within the population are selected and saved. Finally, the final results are presented.
Furthermore, the IAO begins by declaring the parameters of both AO and WOA. Then the AO generates a
X [
] random binary population with size
N and dimension
D. The population values are converted to binary values by Equation (
22).
Then, the initial objective function value is computed using the operators of the AO whereas, the remaining values of the objective function are computed using the IAO structure. This sequence is iterated till meeting stop condition. So, in the final step, the best results are presented as the output of the IAO. The formula Equation (
22) is applied to compute the objective function value:
where
denotes the error of the fitness function Equation (
23),
balances the error and the selected features number. The terms
and
C denote the selected features number and total feature numbers, respectively [
5].
where,
denotes the predictor variable.
refers to the baseline hazard rate function.
refers to the hazard rate at time
t for
.
Moreover, the complexity of the presented algorithm depends on the complexity AO and WOA. So, it is O(N × ( × D)). Since where is O(N × ( × D)). is O(N × ( × D)).
4. Experiment and Results
The use of regression modeling to investigate the effects of many factors on a response is commonplace. In examining time-to-event data, the model of Cox proportional hazards is widely applied. It is a model applied in medical studies to examine the association between a patient’s survival time and other predictor variables.
Creating a model of Cox proportional hazards that includes all of the predictors is undesirable when the number of predictors is enormous since it gives low prediction accuracy and is challenging to comprehend [
33]. Variable selection has become a significant emphasis on Cox proportional-hazards modeling due to these factors.
Four biological benchmark datasets are used in this study to evaluate the modified variant of AO’s (IAO) performance. Diffuse large B-cell lymphoma (DLBC2002) [
34] comprises the samples of 240 lymphoma patients, each of one has 7399 gene expression measurements. The Lung cancer (Lung-cancer) [
35] is the second dataset. This dataset comprises information on 86 lung cancer patients, each one had their gene expression tested 7129 times. The Dutch Breast Cancer (Duch Breasst) is the third dataset [
36] contains 295 breast cancer patients’ information. The data for each patient consists of 4919 gene expression measurements. The cytogenetically normal acute myeloid leukaemia dataset (AML-full) is the fourth collection [
37]. This dataset contained information from 165 problems. A total of 6283 gene expression measurements are included in the data for each prob. The survival time, whether censored or not, is the response variable in both datasets.
The improved version of AO has been compared with a set of more popular optimizers, including the standard AO algorithm, firefly algorithm (Firefly), genetic algorithm (GA), salp swarm algorithm, indicated as SSA, particle swarm optimizer (PSO), in addition to differential evolution (DE), WOA and finally moth-flame optimizer (MFO). The parameters of those algorithms are listed in
Table 1. All optimizers, as mentioned earlier, are implemented over 30 runs with 100 iteration numbers besides to 50 search agents on Matlab“2020a” platform for unbiased comparison. Several statistical metrics have been computed for providing a detailed analysis. The computed metrics of the
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 are the average, worst (Max) (Equation (
25)) and best (Min) (Equation (
24)) fitness function (
) values (Equation (
23)), moreover, the standard deviation (Equation (
26)) and the selected features number are shown below:
where,
refers to the fitness function (
).
N refers to the sample number.
The average of the fitness function, Max, and Min values of the log-likelihood were reported in
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5, respectively, to highlight the performance that our IAO and other employed algorithms can achieve on the four datasets (in all Tables the best values are in boldface). According to
Table 1,
Table 2 and
Table 3, the suggested algorithm, IAO, outperformed the different algorithms for all datasets as it has the least fitness function values. The reported data in
Table 4 affirms the highly consistent performance of the proposed optimizer compared with the AO, Firefly, SSA, GA, PSO, DE, and MFO. The WOA can be located at the section rank after the proposed IAO in handling the first and fourth datasets (DLBC2002, AML-full). In contrast, WOA is not efficient for the Lung-cancer dataset. Moreover, WOA has a remarkable deviation from the IAO for the Duch Breast dataset. Accordingly, the IAO can be considered a successful technique across all datasets. For the selected features number of
Table 5, the IAO chose fewer genes than the other algorithms. The MFO is the worst technique for handling these datasets as it picked the highest number of genes.
For measuring the accumulated performance of the IOA, the mean values of average, Min, Max, standard deviation, and the features number over the four datasets are depicted in
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6. The displayed figures are a shred of evidence of the efficiency and superiority of the IAO and its success in handling the four datasets as it has the minor average, Min, Max, standard deviation, and the selected features number with high performance. These observations are primarily due to the created algorithm’s ability to adjust for the limitations of the typical AO algorithm. In addition,
Figure 7 illustrates the average of the computation time overall datasets. From this figure the IAO showed accepted time compared to the other methods.
The Friedman test is applied to check the statistical significance of the experiment’s methods for further analysis. It is one of the most important statistical tests that indicate the significant differences between the compared algorithms [
38,
39].
Table 7 ranks all methods in all datasets utilizing the Friedman test. From
Table 7, we can conclude that the IAO was ranked first in DLBC2002, Lung-cancer, and AML-full datasets whereas, it was ranked second in Duch Breasst after the Firefly method. Moreover, the IAO showed good performance as in
Figure 8,
Figure 9,
Figure 10 and
Figure 11 which illustrate the boxplot for all datasets.
To analyse the exploration and exploitation of the IAO and the original version of AO, their behaviours with the studied datasets are illustrated in
Figure 12. The curves of the
Figure 12 illustrate the ratios of exploitation and exploration throughout the search stages at the studied datasets for the IOA and standard AO. From these curves, it can be observed that there is a balance between both curves of IAO throughout the search process. The exploration stage raises in the first part of the optimization stage. The exploitation is started after 10% of this process and working together with exploration with a nearly equal ratio as indicated in the cases of DLBC2002, Duch Breast, and AML-full, while the AO is still searching for adequate solutions. Hence the IAO achieves a successful trade-off between the two phases.