1. Introduction
With the consistent growth in the importance of machine learning and big data analysis, feature selection stands to be one of the most relevant techniques in the field. Extending into many disciplines, we now witness the use of feature selection in medical applications, cybersecurity, DNA micro-array data, and many more areas [
1,
2,
3]. Machine learning models can significantly benefit from the accurate selection of feature subsets to increase the speed of learning and also to generalize the results. Feature selection can considerably simplify a dataset, such that the training models using the dataset can be “faster” and can reduce overfitting. A Feature Selection Algorithm (FSA) can be described as the computational solution that produces a subset of features such that this reduced subset can produce comparable results in prediction accuracy compared to the full set of features. The general form of an FSA is a solution that algorithmically moves through the set of features until a “best” subset is achieved [
4].
The existence of irrelevant and/or redundant features motivates the need for a feature selection process. An irrelevant feature is defined as a feature that does not contribute to the prediction of the target variable. On the other hand, a redundant feature is defined as a feature that is correlated with another relevant feature, meaning that it can contribute to the prediction of a target variable whilst not improving the discriminatory ability of the general set of features. FSAs are generally designed for the purpose of removing irrelevant and redundant features from the selected feature subset.
In real-life datasets, knowledge of the full extent of the relevance of the features in predicting the target variable is absent; hence, obtaining an optimal subset of features is nearly impossible. The most common ways to evaluate FSAs in such scenarios would be to employ the feature subsets in a learning algorithm and measure the resultant prediction accuracy [
5]. However, this can prove to be disadvantageous, since the outcome would be sensitive to the learning algorithm itself along with the feature subset(s) [
5].
Consequently, the production of controlled data environments for the purpose of evaluating FSAs has become necessary for the development of novel and robust FSAs. One way of standardizing this is through the use of synthetic datasets. The performance of FSAs depends on the extent of the relevance and irrelevance within the dataset; so, to produce an artificially controlled environment in which the relevance is known can be of significant advantage in their performance evaluation. This can be more conclusive for researchers given that the optimal solutions are known and thus do not rely on external evaluations to determine their performance. Moreover, researchers can easily indicate which algorithms are more accurate based on the number of relevant features selected [
6]. In addition, the use of synthetic datasets provides a standardized platform for FSAs with different underlying architectures to be compared in a model agnostic manner. The existing literature in the field lacks a systematic evaluation of FSAs based on common benchmark datasets with controlled experimental conditions.
This paper presents a set of 10 synthetic datasets generated in a way that specifies the relevant, redundant, and irrelevant features for the purpose of standardizing and giving an unbiased evaluation of an FSA’s performance. The proposed datasets are also organized in ascending complexity to determine the level of performance for any algorithm. In total, 10 datasets are created, drawing inspiration from various natural and algorithmic sources including mathematical models, computational logic, and geometric shapes. These standardized datasets are chosen due to their generalizability compared to the more chaos-driven real-life datasets. Within this framework, we propose the use of synthetic data over real-world data. If an FSA is not able to perform desirably within a synthetic environment, then it is unlikely to perform adequately in real-world conditions. Furthermore, we evaluate the performance of eight popular FSAs to demonstrate the benchmarking capacity of the proposed synthetic datasets. These feature selection algorithms were chosen based on their relevance within the industry and their novelty in the field.
The datasets were generated using Python. The code to access and modify the proposed synthetic datasets is available on our GitHub repository (
https://github.com/ro1406/SynthSelect, accessed on 24 January 2024). Researchers are encouraged to download and manipulate the relevance of features as required for the testing of their own FSAs. The goal of this paper is primarily to facilitate the development and evaluation of novel feature selection algorithms and further study existing algorithms. The contributions of this paper are summarized below:
Introduce a universal set of synthetic datasets for the evaluation of different types of feature selection algorithms within a controlled unbiased environment with known optimal solutions.
Conduct a comprehensive evaluation of popular feature selection algorithms using the aforementioned synthetic datasets to benchmark their performance, allowing us to gauge their performance on real-world datasets.
Provide public access on GitHub to the proposed synthetic datasets to facilitate common benchmarking of both novel and existing FSAs.
This paper is organized as follows: in
Section 2, we present the related works and a literature review. In
Section 3, we describe the generation of the ten synthetic datasets and discuss their characteristics and inspirations. In
Section 4, we describe the methodology used for the evaluation of the performance of eight selected FSAs. In
Section 5, we present and analyze the results of
Section 4. In
Section 6, we provide a synopsis for using our generated synthetic datasets to test and evaluate an FSA of interest. Finally, conclusions and some insights into future work are presented in
Section 7.
2. Literature Review
Synthetic datasets were popularized by Friedman et al. in the early 1990s, where continuous valued features were developed for the purposes of regression modeling in high-dimensional data [
7,
8]. Friedman’s 1991 paper continues to be widely cited in the feature selection literature because it addresses the complex feature selection problem through an application of synthetically generated data. In 2020, synthetically generated adaptive regression splines were used to develop a solution for feature selection in Engineering Process Control (EPC) [
9]. Although specifically used in the context of recurrent neural networks, the importance of synthetic data can be seen in the development of reliable feature selection techniques.
In [
10], Yamada et al. highlighted the relevance of using synthetic data for the development of novel feature selection techniques. The authors discussed the challenge of feature selection when considering nonlinear functions and proposed a solution using stochastic gates. This approach outperforms prior regression analysis methods (such as the LASSO method—for variable selection) and also is more generalizable towards nonlinear models. Examples of applications of these nonlinear models were discussed including neural networks, in which the proposed approach was able to record higher levels of sparsity. The stochastic gate algorithm was subsequently tested on both real-life and synthetically generated data to further validate its performance. The general use of synthetic datasets appears to be for the purposes of validating feature selection algorithms, which is similarly presented in [
11] for the production of a feature selection framework in datasets with missing data. Ref. [
12] explained that the lack of available real data is a challenge faced when considering unsupervised learning in waveform data and suggested the use of synthetically generated datasets to produce real data applications.
Other applications of synthetic data for unsupervised feature selection have been proven effective in the literature, as in the case of [
13]. The authors presented two novel unsupervised FSAs, experimentally tested using synthetic data. The authors recommended the study of the impact of the noisy features within the data as an area of further work. Synthetic data have also been used for evaluating dynamic feature selection algorithms [
14], the process of dynamically manipulating the feature subsets based on the learning algorithm used [
15]. Unsupervised feature selection has been growing in relevance, as it removes the need for class labels in producing feature subsets. Synthetic datasets have also been used for comparatively studying causality-based feature selection algorithms [
16].
Most recently, synthetic datasets were presented as a valuable benchmarking technique for the evaluation of feature selection algorithms [
17]. That paper presented six discrete synthetically generated datasets that drew inspiration from digital logic circuits. In particular, the generated datasets include an OR-AND circuit, an AND-OR circuit, an Adder, a 16-segment LED display, a comparator, and finally, a parallel resistor circuit (PRC). These datasets were then used for the purposes of testing some popular feature selection algorithms. Similar work with discrete-valued synthetic datasets was presented in [
18], where the authors produced a Boolean dataset based on the XOR function. The CorrAL dataset was proposed in that paper containing six Boolean features
, with the target variable being determined by the Boolean function
. Features
were the relevant features,
was irrelevant and finally,
was redundant (correlated with the target variable). CorrAL was later extended to 100 features, allowing researchers to consider higher-dimensional data than the original synthetically generated dataset [
19].
In [
20], the authors developed synthetic data that mimic microarray data. This was based on an earlier study conducted on hybrid evolutionary approaches to feature selection, namely, memetic algorithms that combine wrapper and filter feature evaluation metrics [
21]. Initially, the authors presented a feature ranking method based on a memetic framework that improved the efficiency and accuracy of non-memetic algorithm frameworks [
22].
Another well-known synthetic dataset is the LED dataset, developed in 1984 by Breiman et al. [
23]. This is a classification problem with 10 possible classes, described by seven binary attributes (0 indicating that a LED strip is off and 1 indicating that the LED strip is on). Two versions of this dataset were presented in the literature, one with 17 irrelevant features and another with 92 irrelevant features—both containing 50 samples. Different levels of noise were also incorporated into the dataset, with 2, 6, 10, 15, and 20% noise, allowing the evaluation of FSAs’ tolerance to the extent of noise in the dataset. These synthetic datasets were then used to test different feature selection algorithms, as indicated in [
24]. A similar discrete synthetic dataset is the Madelon dataset [
25], where relevant features are on the vertices of a five-dimensional hypercube. The authors included 5 redundant features and 480 irrelevant features randomly generated from Gaussian distribution. In [
1], the authors tested ensemble feature selection for microarray data by creating five synthetic datasets. It was demonstrated empirically that the feature selection algorithms tested were able to find the (labeled) relevant features, which helped in the evaluation of the stability of these proposed feature selection methods.
Synthetic datasets with continuous variables have also been presented in the literature. In [
26], the authors presented a framework for global redundancy minimization and subsequently tested this framework on synthetically generated data. The dataset contained a total of 400 samples across 100 features, with each sample being broken up into 10 groups of highly correlated values. These points were randomly assigned using the Gaussian distribution. This dataset, along with other existing datasets, was used as the testing framework for the algorithms proposed. Synthetic data have also been used in applications such as medical imaging, where Generative Adversarial Networks (GANs) are employed to produce image-based synthetic data [
27]. However, the limitations of synthetic data must also be noted, as they often pose restrictions when it comes to the various challenges encountered in the feature selection process [
17,
28,
29].
More specifically, it is important to acknowledge the fact that synthetic data often come with a lack of “realism”, meaning that the data generated are not as chaotic as what could be expected in the real world. Many real-world applications come with a tolerance for outliers and randomness, which cannot be accurately modeled with synthetic data [
29]. Furthermore, synthetic datasets are often generated due to the lack of available real-world data, which poses an obstacle in itself. In many cases, the limited nature of the available data restricts researchers from being able to model (and thus synthesize) the data accurately. This can potentially lead to synthetically generated data that are less nuanced than their real-world counterparts. However, this is more often the case for high-dimensional information-dense applications, such as financial data [
30,
31].
Feature selection methods are categorized into three distinct types: filter methods, wrapper methods, and embedded methods. Filter methods are considered a preprocessing step to determine the best subset of features without employing any learning algorithms [
32]. Although filter methods are computationally less expensive than wrapper methods, they come with a slight deficiency in that they do not employ a predetermined algorithm for the training of the data [
33]. In wrapper methods, a subset is first generated, and a learning learning algorithm is applied to the selected subset so that the metrics pertaining to the performance of this specific subset are recorded. The subsets are algorithmically exhausted until an optimal solution is found. Embedded methods, on the other hand, combine the qualities of both filter and wrapper methods [
34]. Embedded feature selection techniques have risen in popularity due to their improved accuracy and performance. They combine filters and classifiers and havee the advantages of different feature selection methods to produce the optimal selection on a given dataset.
Despite a wide variety of algorithms for feature selection, there is no agreed “best” algorithm, as FSAs are generally purpose-built for a specific application [
24]. By this, we are referring to the fact that different FSAs work well on different data, depending on their size, type, and general application. In this paper, we introduce a collection of synthetic datasets that can be used to test the performance of different FSAs in a more standardized evaluation process.
4. Methodology
In this study, we selected an array of FSAs and tested their performance on our synthetically generated datasets. Our tests involved running the algorithms on our datasets and tasking them to provide a number of features equal to the number of originally generated relevant features and then twice that number. Each time, we used the number of correct features selected as the performance metric. We defined a correct feature as either a relevant feature or a redundant feature of a relevant feature that had not been selected yet. If an algorithm selected two redundant features that were generated from the same relevant feature, then it counted as only one correct feature being selected.
Despite a wide variety of algorithms for feature selection, there is no agreed “best” algorithm, as FSAs are generally purpose-built for a specific application [
24]. By this, we are referring to the fact that different FSAs work well on different data, depending on their size, type, and general application. One of the most popular feature selection algorithms that is well-tested within the field is Minimum Redundancy Maximum Relevance (mRMR) [
35]. mRMR selects features based on calculations of which features correlate most with the target (relevance)—and which features correlate least with each other (redundancy) [
36]. Both of these optimization criteria are used to develop the feature selection information. Other feature selection algorithms include decision tree entropy-based feature selection [
37], Sequential Forward Selection (SFS) [
38], and Sequential Backward Selection (SBS) [
39].
4.1. Feature Selection Algorithm Testing
For this work, we used the following list of FSAs:
Entropy—an algorithm that tries to maximize the information gained by selecting a certain feature with entropy as the measure of impurity [
40].
Gini Index (Gini)—similar to Entropy, this algorithm aims to maximize the information gain when selecting features but uses the Gini Index as its measure of impurity [
41].
Mutual Information (MI)—this algorithm uses the concept of mutual information, which measures the reduction in uncertainty for the target variable given the selection of a certain feature [
41].
Sequential Backward Selection (SBS)—this algorithm sequentially removes features until there is significant increase in the misclassification of a supporting classifier. We used a Support Vector Machine with an rbf kernel for our experiments [
42].
Sequential Forward Selection (SFS)—this algorithm is similar to SBS; however, it is the sequential adding of features until there is no significant decrease in the misclassifcation of a supporting classifier. We use a Support Vector Machine with an rbf kernel for our experiments [
42].
Symmetrical Uncertainty (SU)—uses the interaction of a feature with other features to determine which features are best [
43].
Minimum Redundancy Maximum Relevance (mRMR)—this algorithm is a minimal-optimal feature selection algorithm that sequentially selects the feature with maximum relevance to the target variable but also has the minimum redundancy when compared to the previously selected features [
44].
Genetic Feature Selection Algorithm (GFA)—a genetic algorithm inspired by the concepts of evolution and natural selection used from the sklearn-genetic library [
45].
In the next section, the selected FASs are ranked and evaluated on their ability to identify the correct features in the generated datsets and also on their robustness or resilience to noise.
4.2. Complexity
When developing testing benchmarks for feature selection algorithms, an understanding of how difficult or complex the datasets used are is crucial. While designing and generating our synthetic datasets, we developed an understanding of how hard it would be for a feature selection algorithm to accurately pick out the relevant features. By using this information, we can measure the sophistication of feature selection algorithms based on how well they perform on datasets with different levels of complexity.
There is limited literature on a commonly agreed “difficulty” or “complexity” of synthetically generated datasets. Only a few papers such as [
46,
47] have attempted to define the complexity of a dataset. However, neither are regarded as the standard to measure complexity in the field. Hence, we propose groupings of complexity based merely on empirical results. To do so, we tested eight different FSAs on all the datasets and kept a record of the percentage of correct features identified across all datasets, where a correct feature is defined in the next
Section 4. We used the distribution of the average percentage of identified correct features as the basis for grouping the datasets in terms of difficulty. A statistical test was applied to validate the groupings, where we expected datasets in each group to have a similar performance across the different FSAs.
4.3. Noise Resilience
Noise is a common feature of most datasets. As a result, research has been conducted into the handling of noise in all kinds of datasets and applications [
48,
49,
50]. In this work, we explored the robustness of Feature Selection Algorithms through two kinds of noise: asymmetric label noise and irrelevant feature addition. Asymmetric label noise was created by selecting a random fraction of data points in the dataset and swapping their labels to the other class using fixed rules. For example, a point selected from the
ith class would be converted to the (
i + 1)th class in that dataset. Meanwhile, in our following experiments, we removed the irrelevant features from the dataset and then replaced these features while tracking the change in the FSA performance. Other methods of adding noise exist in the literature, like the addition of Gaussian noise to the data samples. We decided to limit ourselves to two methods of noise addition as it seemed sufficient for the scope of this work. Below is the summary of our noise addition methods:
Class Changing: Randomly change the target variable classes to different classes for a subset of data.
Reducing the number of irrelevant features: Randomly dropping a subset of irrelevant variables from the generated data, i.e., reducing the level of noise in the dataset.
For each of the two noise addition methods, we varied the size of the subset of the data to which noise was added by varying a percentage of the classes of the instances (or number of irrelevant features, respectively), between 0% and 50%. As the level of noise moves from 0% to 50% for class changing, the level of noise increases. Furthermore, the level of noise increases, as the percentage of the removed irrelevant features decreases from 50% to 0%.