1. Introduction
In the last decade, there has been a growing scientific interest in the analysis of DNA microarray datasets, which involved different research fields, such as biology, bioinformatics, and machine learning. DNA microarrays contain measures relative to tests for quantifying the types and the amounts of Messenger Ribonucleic Acid (mRNA) transcripts present in a collection of cells, thus allowing researchers to characterize tissue and cell samples regarding gene expression differences. These data have been widely used in basic and translational cancer research. In this context, DNA microarrays were used either to separate healthy patients from oncological ones, based on their “gene expression” profile (two class problems), or to distinguish between different types of cancer (multiclass problems).
DNA microarray experiments typically generate a very large number of features: for each patient, in fact, each feature represents a gene expression coefficient corresponding to the abundance of Messenger Ribonucleic Acid (mRNA) in a concrete sample [
1]. The effect is that the number of features in the raw data ranges from 6000 to 60,000, while the available datasets usually refer to less than 100 patients [
2]. This implies that the classification task is very challenging, due to both the high number of gene expression and the small number of samples [
3], and require the implementation of an accurate feature-selection process to reduce the complexity of the feature space and to identify a subset of highly distinctive genes [
4,
5]. It has been demonstrated, in fact, that most of the genes measured in DNA microarray experiments are not relevant to improve classification accuracy for many of the classes to be recognized [
6]. The application of a feature-selection process, however, is not only due to the need for the elimination of redundant or non-distinctive features, but also to more accurately correlate gene expression to diseases. This last aspect is very important for biologists and explains the scientific interest for the development of new features selection algorithms, as evidenced by the large number of works published in the literature in recent years. Nevertheless, it has been highlighted that in this framework there are no standard state-of-the-art results generally accepted by the scientific community: therefore, it is difficult to compare the effectiveness of new algorithms and decide which approach can provide better results in the general case. The problem is even more complex when considering the large number of microarray data sets available whose properties, in terms of both the number of features and the number of patients, can vary significantly [
2].
Based on these considerations, the aim of the present work is to provide a broad experimental comparison on the feature-selection and classification techniques applied to different DNA microarray datasets. Unlike other works that have analyzed these aspects separately [
2,
7], our aim is to study the combined effects of the feature-selection process on the classification results, and to evaluate the sensitivity of the considered classification methods with respect to the size of the feature space. Moreover, taking into account the complexity of the feature-selection problem in case of thousands of features, we focused our analysis on standard feature ranking techniques and we evaluated the classification methods considering in the experiments feature sets of increasing size, obtained by selecting the features according to the order in which they appear in the ranking [
8]. For the sake of completeness, we have also compared the obtained results with those relative to the use of three state-of-the-art feature-selection methods.
In the experiments, we considered six DNA microarray datasets with different properties in terms of both the number of features and the number of patients. As for the classification process, we chose four classification methods, namely decision trees, random forest, nearest neighbor and multilayer perceptron, thus providing the interested reader with a broad overview of the results obtainable with the most efficient and widely used classification schemes.
The remainder of the paper is organized as follows:
Section 2 briefly illustrate the characteristics of the considered DNA microarray datasets,
Section 3 discusses the feature-selection methods, while
Section 4 presents the experimental results. Some conclusions are eventually left to
Section 5.
3. Feature Selection
Feature selection is the process of reducing the dimensionality of the available data, with the aim of improving the recognition results. This process typically consists of three steps: a search procedure for searching the solution space made of all the possible solutions, i.e., feature subsets, an evaluation function, and a stopping criterion.
The main difficulty of feature selection is due to the dimension of the search space, which grows exponentially with the number of available features: for data represented by
N features, there are
feasible solutions. This makes the exhaustive search strategy computationally intractable [
15]. This is typically the case of microarray datasets in which, as previously evidenced,
N is of the order of thousands. To overcome this problem, heuristic algorithms are typically used for finding near-optimal solutions.
As for the evaluation functions, they are generally subdivided into three wide categories, namely filter, wrapper and embedded methods [
16,
17]. Filter methods measure statistical or geometrical properties of the subset to be evaluated, whereas wrapper functions adopt as evaluation measure the accuracy achieved by a given, previously chosen, classifier. Finally, embedded approaches include feature selection in the training process, thus reducing the computational costs due to the classification process needed for each subset.
As just mentioned, wrapper methods are computationally costly because the evaluation of each subset requires the training of the adopted classifier. For this reason, they are typically used with near-optimal search strategies, which can achieve acceptable results, but limiting the computational costs. As for the filter methods, they need non-iterative computations on the dataset which are, in most of the cases, significantly faster than classifier training sessions. Moreover, filter approaches, estimating intrinsic properties of the data, provide more general solutions, typically performing well on a wide family of classifiers.
A particular category of filter methods is that of the ranking ones, which evaluate each feature singularly. Once all features have been evaluated, they are ranked according to their merit. Then the subset search step is straightforward: the best M features are selected, with M set by the user. Unfortunately, even if this approach is very fast and allow dealing with thousands of features, there is no one general criterion for choosing the dimension of the feature space, then it is difficult to select the number M of features to be selected. Moreover, most importantly, relevant features that are highly informative when combined with other ones could be discarded because they are weakly correlated with the target class.
In this study, because of the huge dimensionality of the datasets used for the experiments, we have considered only standard ranking algorithms to build up an experimental protocol for selecting the feature subsets allowing us to achieve the best classification results. The adopted protocol did not use any search strategy, but adds the features progressively, according to their position in the ranking. As for the ranking, we have considered five standard univariate measures, namely Chi-square, Relief, Gain Ratio, Information Gain, and Symmetrical Uncertainty. These measures, as well as the state-of-the-art feature-selection approaches considered for the comparison, are detailed in the following subsections.
4. Experimental Results
As anticipated in the Introduction, the effectiveness of the feature subsets obtained by applying the selection techniques described in
Section 3, has been evaluated by using four different classification schemes, namely decision trees, random forest, nearest neighbor, and multilayer perceptron. For each of them, we performed a set of experiments using the microarray datasets described in
Section 2, following the same experimental protocol. The adopted classification schemes, the experimental protocol, and the sets of experiments are described in the following subsections.
4.1. The Classification Schemes
As for the classifiers considered in this study, we used the implementation provided by the WEKA tool [
25] choosing a 10-fold cross validation strategy. Furthermore, to manage the randomness of classification algorithms, we performed 20 runs for each experiment. All the results reported below were obtained by averaging the values over the 20 runs.
A decision tree (DT) is a classification method that uses a knowledge representation based on a tree structure. In such structure, the internal nodes represent specific tests on the features used to represent patterns, while the branches coming out of a node represent the possible results of these tests. Finally, the leaf nodes represent the class assigned to an input sample based on the results of all the tests performed along a path from the root node to the leaf node. In this study, we considered C4.5 learning algorithm, which builds the decision tree with a top-down approach, using the concept of information entropy. In practice, given a training set , it breaks down into smaller subsets by gradually increasing the depth of the tree, until it reaches a leaf node in which the process terminates. In each node of the tree, C4.5 chooses the feature that most effectively allows the splitting of the corresponding sample subsets, using the normalized entropy gain as a splitting criterion: this criterion measures how much the obtained subsets are homogeneous, in terms of class labels.
The term
random forest (RF) does not refer to a single algorithm, but rather to a family of methods for creating an ensemble of tree-based classifiers. The original algorithm proposed by Breiman in [
26] is usually denoted
Forest-RI in the literature and it is considered a reference method in most of the papers dealing with RF. Given a training set that contains
feature vectors, each consisting of
N features, the forest-RI algorithm is composed of the following steps applied to each tree:
Randomly extract samples with replacement from the dataset. The obtained set of samples will be used as training set for the starting node of the tree.
Set a number .
At each node, randomly select K features from the whole set of available ones.
For each of the selected features, consider its values in the training set and choose the best binary split value according to the Gini index [
27]. Then, choose the feature with the best index value and generate two new nodes by splitting the samples associated with the original node according to such a value.
Increase the depth of the tree to its maximum size, according to the defined stopping criterion. Note that node splitting usually is stopped when one of the following conditions occur:
(i) The number of samples in the node to be split is below a given threshold;
(ii) all the samples in the node belong to the same class.
Leave the tree not pruned.
Once the forest has been built, an unknown sample is labeled according to the Majority Vote rule: i.e., it is labeled with the most popular class among those provided by the ensemble trees.
The K-Nearest Neighbor algorithm (K-NN) is a well-known non-parametric classification technique, according to which an unknown sample is assigned the most frequent class among those of its k-nearest samples of the training set. The rationale behind this technique is that given an unknown sample to be assigned to one of the classes of the considered application field, the a-posteriori probabilities in the neighborhood of may be estimated by taking into account the class labels of the k-nearest neighbors of . Although its simplicity, K-NN proved to be able to provide high performance. The results shown below were obtained by using the Mahalanobis distance, which demonstrated to be more effective than the Euclidean distance in a set of preliminary experiments. As regards the value of the parameter k, we found that the value significantly outperformed the higher values. This result suggests that for these kinds of data, higher value of k force the algorithm to consider far neighbors belonging to classes different from the actual class of the sample at hand. The effect is that of assigning several samples to a wrong class, reducing the overall recognition rate. This behavior could also be a consequence of the limited number of specimens available in the training set, as well as a specificity of the considered DNA microarray datasets.
An artificial Neural Network (NN in the following) is a well-known information processing paradigm inspired by the way biological nervous systems process information. These networks typically consist of many highly interconnected processing elements (neurons), often equipped with a local memory, that can process information locally, providing as output a single unidirectional signal transmitted to all the connected neurons. A NN can be configured to solve specific problems, such as pattern recognition or data classification ones, through a learning process, which implies the adjustment of the information locally stored in each neuron. In this study, we considered the multilayer perceptron network with a feed-forward completely connected three-layer architecture. The training was performed by using the back-propagation algorithm, which is one the most popular training method, successfully applied to a large variety of real-world classification tasks.
5. Conclusions and Future Works
In this study, we performed a wide experimental comparison on the combined effect of both feature-selection and classification techniques applied to different DNA microarray datasets, paying particular attention to evaluate how much the number of selected features can affect the obtainable classification performance.
Since DNA microarray experiments typically generate a very large number of features (in the order of thousands) for a limited number of patients (in the order of hundreds or less), the classification task is very complex and typically require the application of a feature-selection process to reduce the complexity of the feature space and to identify a subset of distinctive features.
In our experiments, we considered six microarray datasets, namely Breast, Colon, Leukemia, Lymphoma, Lung and Ovarian, and four classification schemes, namely decision trees, random forest, nearest neighbor, and multilayer perceptron. Such datasets represent an effective test bed, since they exhibit a large variability as regards the number of attributes, the number of classes (two or multiple classes problems) and the number of samples. The considered classifiers provide a broad overview of the results obtainable with the most efficient and widely used classification schemes.
As for the feature-selection process, taking into account the complexity of this problem in the case of thousands of features, we focused our analysis on standard feature ranking techniques, which evaluate each feature singularly, estimating intrinsic properties of the data. These techniques typically provide general solutions, performing well on a wide family of classifiers, even if they could discard relevant features that are weakly correlated with the target class, but highly informative when combined with other features. The performance of the classification methods was evaluated by using feature sets of increasing size, obtained by selecting the features according to their position in the ranking. Despite this simplified choice, we obtained very interesting results even in comparison with those produced by the other three state-of-the-art feature-selection methods considered in this study, namely the Sequential Forward Floating Search, the Fast Correlation-Based Filter and the Minimum Redundancy Maximum Relevance.
The results confirmed that the feature-selection process plays a key role in this application field and that a reduced feature set allowed us to significantly improve classification performance. They also showed that by accepting a slight reduction of the classification rate, it is possible to further reduce the feature set, significantly improving the computational efficiency of the whole classification system.
As a future work, we would like to improve the classification system by using better performing strategies, such as those based on classifier ensembles [
28,
29,
30]. The use of Learning Vector Quantization networks could also provide interesting results, as they are generally considered a powerful pattern recognition tool and give the advantage of providing an explicit representation of the prototypes of each class [
31] (in this case, a prototype represents the specific gene expression values characterizing tissue and cell samples).