RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning

Kim, Jingeun; Park, Hye-Jin; Yoon, Yourim

doi:10.3390/app13042698

Open AccessArticle

RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning

by

Jingeun Kim

¹

,

Hye-Jin Park

² and

Yourim Yoon

^3,*

¹

Department of IT Convergence Engineering, Gachon University, Seongnam-daero 1342, Seongnam-si 13120, Republic of Korea

²

Department of Food Science and Biotechnology, Gachon University, Seongnam-daero 1342, Sujeong-gu, Seongnam-si 13120, Republic of Korea

³

Department of Computer Engineering, Gachon University, Seongnam-daero 1342, Seongnam-si 13120, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2698; https://doi.org/10.3390/app13042698

Submission received: 29 December 2022 / Revised: 13 February 2023 / Accepted: 15 February 2023 / Published: 20 February 2023

(This article belongs to the Special Issue Applications of Artificial Intelligence in Biomedical Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Parkinson’s disease is a neurodegenerative disease that is associated with genetic and environmental factors. However, the genes causing this degeneration have not been determined, and no reported cure exists for this disease. Recently, studies have been conducted to classify diseases with RNA-seq data using machine learning, and accurate diagnosis of diseases using machine learning is becoming an important task. In this study, we focus on how various feature selection methods can improve the performance of machine learning for accurate diagnosis of Parkinson’s disease. In addition, we analyzed the performance metrics and computational costs of running the model with and without various feature selection methods. Experiments were conducted using RNA sequencing—a technique that analyzes the transcription profiling of organisms using next-generation sequencing. Genetic algorithms (GA), information gain (IG), and wolf search algorithm (WSA) were employed as feature selection methods. Machine learning algorithms—extreme gradient boosting (XGBoost), deep neural network (DNN), support vector machine (SVM), and decision tree (DT)—were used as classifiers. Further, the model was evaluated using performance indicators, such as accuracy, precision, recall,

F_{1}

score, and receiver operating characteristic (ROC) curve. For XGBoost and DNN, feature selection methods based on GA, IG, and WSA improved the performance of machine learning by 10.00% and 38.18%, respectively. For SVM and DT, performance was improved by 0.91% and 7.27%, respectively, with feature selection methods based on IG and WSA. The results demonstrate that various feature selection methods improve the performance of machine learning when classifying Parkinson’s disease using RNA-seq data.

Keywords:

Parkinson’s disease; RNA sequences; genetic algorithm; information gain; wolf search algorithm; extreme gradient boosting; deep neural network; support vector machine; decision tree

1. Introduction

Parkinson’s disease is a representative neurodegenerative disease that is associated with numerous genetic and environmental factors. However, the genes affecting it are still unclear [1,2,3]. Currently, there is no underlying treatment for Parkinson’s disease, and treatment is ineffective if the symptoms worsen. Therefore, an early diagnosis and treatment of Parkinson’s disease is vital [4].

Research has been conducted to classify cancer [5,6,7,8] and other diseases [9,10,11] using machine learning. For classifying Parkinson’s disease, many recent studies have been conducted using various types of data, such as magnetic resonance imaging (MRI), electroencephalograms (EEG), voice signals, breathing signals, and gene expression data [12,13,14,15,16,17,18]. In particular, the development of next-generation sequencing (NGS) technology enables the definition of the transcriptomic landscape for cancer and other diseases, which aids in patient diagnosis [19,20].

Many studies have conducted a metaheuristic-based feature selection method to improve the performance of machine learning. Metaheuristic algorithms are recognized as robust problem solvers to solve a variety of problems of different types—genetic algorithm [21,22], particle swarm optimization [23], heat transfer search optimization [24], starling murmuration optimizer [25,26], wolf search algorithm [27], and grey wolf optimization [28,29] are some of the successful metaheuristic algorithms, which can be applied to feature selection for machine learning.

The performance of feature selection methods is generally related with machine learning models such as extreme gradient boost (XGBoost), multilayer perceptron (MLP), deep neural networks (DNN), support vector machine (SVM), and decision trees (DT). In particular, tree-based ensemble models such as XGBoost showed outstanding performance on prediction and classification problems [30,31,32,33]. For example, according to R. Sarkhani Benemaran et al. [34], the combination of particle swarm optimization and the XGBoost model showed good performance.

Research on classifying diseases by feature selection of RNA sequence (RNA-seq) has been steadily presented in the literature [35,36,37]. This study used data from 29 patients with Parkinson’s and 44 RNA-seq recordings without Parkinson’s [38]. RNA-seq data were classified using machine learning using various feature selection methods. There was only one paper [39] that used the same data as this study. In this previous work, a deep neural network called DEGnet was proposed to predict only UR and DR genes in Parkinson’s disease, and the classification problem, which is investigated in this study, was not addressed. The machine learning models used in this study include extreme gradient boost (XGBoost), deep neural networks (DNN), support vector machine (SVM), and decision trees (DTs). Cross-validation was conducted to evaluate the performance of the model. Furthermore, the model was evaluated using receiver operating characteristic (ROC) curves. In this study, genetic algorithms, information gain, and wolf search algorithms are examined as feature selection methods. There have been no previous studies comparing the performances of these three feature selection methods with RNA-seq data.

The main contribution of this paper can be summarized in three points as follows:

To verify the performance of feature selection when classifying diseases using machine learning with RNA-seq data.
To investigate how various feature selection methods affect the performance of machine learning.
To compare the performance of machine learning according to various feature selection methods (genetic algorithms, information gain, wolf search algorithms).

The rest of the paper is organized as follows: Section 2 describes the background associated with this study; Section 3 describes the proposed various feature selection methods and the hyperparameters of each machine learning algorithm; Section 4 shows the score of each machine learning algorithm and ROC curve; Section 5 summarizes and discusses the results for the various feature selection methods and machine learning classification; and finally, Section 6 concludes this study.

2. Background

2.1. RNA Sequencing (RNA-seq)

RNA-seq is a technique used for analyzing the transcriptome profile of organisms using next-generation sequencing [40]. With the growth of NGS technology, RNA-seq has been used for gene expression analysis, co-expression network analysis, alternative specifying analysis, simple generic sequence repeat analysis, and pathway analysis since 2005 [41].

RNA-seq utilizes deep-base technologies. The process is as follows: An adapter is first mounted on both ends of RNA and converted into a cDNA fragment library [42]. After basing the created sequencing library, the parts of the genome of the DNA base sequence that are similar are analyzed [43]. RNA-seq was obtained through this process. In contrast with conventional research methods that focus on genes and tissues, RNA-seq allows all genetic information to be read and analyzed simultaneously.

2.2. Dataset Used for This Study

We searched for Parkinson’s disease data on the National Center for Biotechnology Information (NCBI), an online site with diverse biological data. As a result, we obtained RNA-seq data that [44] measured RNA expression levels in the brain tissue of 73 people (both with and without Parkinson’s disease). These data showed 17,580 protein-coding genes identified by RNA sequencing [38].

Table 1 shows a statistical summary of the number of samples, age of death, post-mortem interval (PMI), and RNA integrity number (RIN) of control and with Parkinson’s disease. Through RNA sequencing, A. Dumitriu et al. confirmed that they identified 17,580 protein-coding genes, and 1095 (6.2%) of them were significantly different (FDR p-value < 0.05), but out of 3558 proteins, only 166 protein code genes (0.94%) were significant in false detection rate (FDR) [38].

2.3. Feature Selection

Feature selection methods can be categorized into three types: filter, wrapper, and embedded. The filter method uses predetermined criterions, such as Pearson correlation and mutual information between features, to select the optimal features [45]. In this case, a subset of the selected features was input into the classification algorithm. This method has the advantage of calculating quickly because it is independent of the classification algorithm [46]. The wrapper method uses a specific classification algorithm for feature selection. A subset of features optimized for a particular classification algorithm was selected by generating and evaluating various subsets [47,48]. Although it is more accurate than the filter method, the computation is complicated because it requires executions of classification algorithms on subsets of all features. Embedded methods combine feature selection and learning procedures into a single process to reduce computational time [49]. Therefore, in contrast to other methods, feature selection and classification algorithms cannot be separated [50]. This study used genetic algorithm (GA), information gain (IG), and wolf search algorithm (WSA) as feature selection methods.

2.4. Genetic Algorithm (GA)

GAs are optimization techniques developed by Holland, which mimic Darwin’s theory of evolution [51]. The main mechanism of GAs is that individuals who are more suited to their environment have a better chance of survival. The good genes of surviving individuals are passed on to the next generation, eventually leaving only the good genes [52].

GAs perform tasks such as initialization, selection, crossover, and mutation. Here, initialization usually generates a population randomly. The size of a population can vary depending on the problem. As the generation progresses, the selection of a portion of the population is required to create new individuals [53]. Roulette wheel selection and tournament selection are well known. The selected group creates a new offspring through crossover and mutation. Previous research shows that having better parents results in better offspring [54]. This process is repeated until the termination condition is met [55].

2.5. Information Gain (IG)

Information gain is an entropy-based feature evaluation method, widely used in the field of machine learning [56]. Information gain calculates the reduction in entropy from the transformation of a dataset. The equation for calculating information gain is as follows:

I G (T, a) = H (T) - H (T | a)

(1)

where

T

is a random variable, and

H (T | a)

is the entropy of

T

given the attribute

a

[57]. It can be used for feature selection by evaluating the information gain of each variable in the context of the target variable.

2.6. Wolf Search Algorithm (WSA)

Wolf search algorithm is a new bioinspired heuristic optimization algorithm that is based on the way wolves search for food and survive by avoiding their enemies [27]. Each wolf in WSA hunts independently by remembering its characteristics. Unlike ants that communicate using pheromones, wolves in WSA do not need communication. Consequently, WSA can reduce search time by omitting communication. WSA can be explained in three rules based on the wolf’s hunting behavior [58,59]:

Each wolf has a fixed radius, which represents the visual area in 2D. Coverage is the area of the circle by the fixed radius. Each wolf can detect a companion appearing within its coverage, and the distance that the wolf can move is smaller than the radius.
The quality of the wolf is represented using the objective function. Wolves try to move to better terrain. If there are one or more better positions, the wolf will choose the best terrain to inhabit among the given options.
Wolves can sense enemies and escape to random positions away from threats and beyond their fixed radius.

WSA obtains an optimal solution by gathering multiple leaders rather than by searching in a single direction based on these basic rules [60].

2.7. Machine Learning Algorithms for Classification

Machine learning can be categorized into supervised, unsupervised, semi-supervised, and reinforcement learning [61]. Supervised learning predicts labels when new data are input. Unsupervised learning is a method of determining clusters based only on input data that are not labeled. Semi-supervised learning is a kind of machine learning that uses labeled and unlabeled data [62]. In this case, models are learned using training data including labels. Reinforcement learning is an optimization approach for machine learning that interacts with the environment to learn optimal behavior [63]. In this study, we used supervised learning to classify Parkinson’s disease using features selected by GAs. We selected XGBoost, DNN, SVM, and DT as classifiers.

XGBoost is a tree-based algorithm that was proposed by T. Chen et al. [44]. XGBoost uses the rules of decision tree to classify the data into leaves. It uses the leaves’ scores to calculate the final prediction. The objective function of XGBoost is to minimize the regularized objective by adding a regularization term as follows:

O b j (θ) = L (θ) - Ω (θ)

(2)

where

L

is the loss function, and

Ω

is the regularization term used to avoid overfitting by controlling the model’s complexity.

Ω (θ)

is expressed as follows [44]:

Ω (θ) = γ T + \frac{1}{2} λ {‖ w ‖}^{2}

(3)

where

T

is the number of leaf nodes, and

w

is the weight of each leaf.

γ

and

λ

are regularization parameters [64,65,66,67]. XGBoost is created using boosting techniques. Gradient-boosting techniques use residuals from existing models to generate new models for final prediction [68]. Gradient descent algorithms are used to add new models, and thus, they exhibit excellent efficiency and high prediction accuracy [69].

DNNs have emerged through technological development of artificial neural networks. A DNN is an artificial neural network consisting of an input, output, and two or more hidden layers. Based on DNNs, various algorithms, such as convolution neural network (CNN) and recurrent neural network (RNN), have been developed, and they have shown excellent performance in various fields, such as computer vision and natural language processing [70].

SVM is a classification and regression machine learning model used in various fields, including pattern recognition [71] and image classification [72]. The principle of SVM is to generate one or more feature vectors for the classification and decision boundaries between two classes [73]. SVM is an optimal hyperplane-based mathematical learning scheme that transforms the original training data into multidimensional spaces to perform classification and construct hyperplanes in high dimensions [74]. SVM is basically a linear classifier that classifies linearly separable data. However, real-world data may not be linearly separated. To overcome this, kernel tricks are used. Thus, the performance of SVM depends on selecting the appropriate kernel [75]. The linear kernel is the simplest kernel used for SVM. The expression of the linear kernel function [76] is as follows:

K (x_{i}, x_{j}) = 〈 x_{i}, x_{j} 〉

(4)

where

x_{i}

and

x_{j}

denote data points, and

K

denotes kernel function.

A DT is a diagram consisting of root, branch, and leaf nodes as one of the training learning models. The branches of the tree represent decisions, occurrences, or reactions [77]. The root node is the highest of all the nodes. Each node represents a shape, each branch represents a decision rule, and the leaf node represents the result [78].

3. Method

3.1. Feature Selection Using Genetic Algorithms

In this study, we performed feature selection using GAs. Figure 1 shows a flowchart of the GA. According to Haupt [79], diversity decreases if the population size is less than 30, and if the population size is more than 120, the GA converges more slowly. Accordingly, the population size was set to 100 so that diversity was guaranteed without slowing down the convergence speed. In addition, the number of features was set to 25, 50, 75, and 100 out of 17,850 for comparison. In order to select the best number of features, the experiment was conducted with various numbers of features. We executed experiments using two fitness functions. The fitness functions are represented in the following [80,81]:

w \times |\bar{ρ (X, Y)}| + {|ρ^{'} (X, P)|}^{- 1}

(5)

\frac{k \times ρ^{'} (X, P)}{\sqrt{k + k \times (k - 1) \times ρ (X, Y)}}

(6)

where

w

is a constant as an intercorrelation weight factor, which was set to 2 in this experiment, and

k

is the number of features.

|\bar{ρ (X, Y)}|

is the correlation value between features

X

and

Y

, and

{|ρ^{'} (X, P)|}^{- 1}

is the correlation value between feature

X

and label

P

. The genetic algorithm using Equation (5) (GA1) converges as the value decreases, and the genetic algorithm using Equation (6) (GA2) converges as the value increases. In each generation, the top 50% were selected as parents based on fitness. For a pair of parent chromosomes, one-point crossover with a 100% probability was applied, and two offspring were created. After the crossover operation, a 10% probability of bit-flip mutation was applied. The bottom 50% of chromosomes were replaced with the newly created offspring.

Figure 2 and Figure 3 shows the best fitness of each generation of genetic algorithms using Equation (5) (GA1) and genetic algorithms using Equation (6) (GA2). The appropriate number of generations was determined based on this result. Both algorithms converge without improvement after the 80th generation. Therefore, we set the number of generations to 100. The whole experimental results of machine learning models according to the number of features selected by GA1 and GA2 are provided in Appendix A.

3.2. Feature Selection Using Information Gain

This method measures attributes’ significance by the information gain calculated to the target class. The combination of this attribute evaluator with the ranking method of search is applied to the RNA-seq dataset [82]. We conducted an information gain experiment using Weka and selected 25, 50, 75, and 100 attributes as input values for machine learning.

3.3. Feature Selection Using Wolf Search Algorithm

WSA was performed using Weka in this study. The attribute evaluator adopted for the experiment was a correlation-based feature selection (CFS) algorithm. CFS finds a subset of features highly correlated with the target class and lowly correlated with the other features [81,83]. Wolf search algorithm was performed to find the best feature set based on this CFS evaluator. Both the number of iterations and the population size of WSA were set to 100. The start points for search were set to 1 and 10.

3.4. Machine Learning Model Implementation

Table 2 lists the parameters of XGBoost and SVM used in the experiment. In XGBoost used in this study, the number of gradient boost trees was set to 100, and the learning rate was set to 0.1. The gamma of the used XGBoost was set to 0, and the maximum depth was set to 6. To conduct a binary classification (Parkinson’s and control), the learning objective was set to binary. For SVM, the C was set to 1, and

γ

was set to 1/(number of features · variance of input values). To classify the two classes, SVM used linear kernel.

Table 3 lists the layer, output shape, and number of parameters of the DNN used in the experiment. It consisted of one input layer, seven hidden layers, and one output layer. In addition, a dropout layer was used to prevent the overfitting of the model. For the activation function of all layers except the output layer, the rectified linear unit (ReLU) function was used to prevent gradient vanishing. The activation function of the output layer was set to the sigmoid function usually used in binary classification. The expressions for the ReLU and sigmoid functions are as follows:

f (x) = x^{+} \equiv \max (0, x) = \{\begin{matrix} 0 (x < 0) \\ x (x \geq 0) \end{matrix}

(7)

S i g m o i d (x) = σ (x) = \frac{1}{1 + \exp (- x)}

(8)

where

x

represents the input value of the network.

When training the model, the stochastic gradient descent method (SGD) was chosen as the optimizer, and the learning speed and momentum were set to 0.001 and 0.9, respectively. The expression for SGD is as follows [84]:

v_{t + 1} = μ * v_{t} + g_{t + 1} p_{t + 1} = p_{t} - l r * v_{t + 1}

(9)

where

p

represents the parameters,

g

represents the gradient,

v

represents the velocity,

l r

represents the learning rate, and μ represents the momentum.

Because the activation function of the output layer is a sigmoid function, the loss function is set to binary cross entropy loss (BCELoss). The BCELoss equation is as follows:

l_{n} = - \frac{1}{M} \sum_{m = 1}^{M} [y_{m} \times \log (h_{θ} (x_{m})) + (1 - y_{m} \times \log (1 - h_{θ} (x_{m}))]

(10)

where

M

represents the number of training sessions,

y_{m}

represents the actual label at the

m

th learning session,

x_{m}

represents the input value at the

m

th learning session, and

h_{θ}

represents the weight

θ

of the model. The epoch, the number of times the model was learned, was set to 250. The batch size of the DNN, which refers to the number of training sets used in one iteration, was set to be the same as the size of the training set. That is, just one batch was used for DNN.

In the case of DT, the quality of the split was evaluated using Gini impurity [85] as a criterion, and the best split was chosen. The minimum number of samples required to split an internal node was set to 2.

The four machine learning models used in the experiment were implemented using Python. The sklearn and PyTorch libraries were used to create the model, and the source code is attached to https://github.com/JingeunKim/PDclassification (accessed on 14 February 2023). All machine learning models were learned and tested on systems with Apple M1 Max GPU, Apple M1 Max @ 3.22 GHz, and 32 GB memory.

3.5. Model Evaluation

K-fold cross-validation was used for model evaluation. K-fold cross-validation divides the dataset into K sets and uses the K-1 set as the training set and the other as the test set. The average accuracy was obtained by repeating this process K times. In this study, five was used as the K value.

The

a c c u r a c y

(Equation (11)),

p r e c i s i o n

(Equation (12)),

r e c a l l

(Equation (13)), and

F_{1} s c o r e

(Equation (14)) based on the confusion matrix were used to evaluate the performance of the machine learning models. The expressions for the

a c c u r a c y

,

p r e c i s i o n

,

r e c a l l

, and

F_{1} s c o r e

are as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(11)

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

F_{1} s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

Here, TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives, respectively. In addition, the model was evaluated using an ROC curve based on

s e n s i t i v i t y

(Equation (15)) and

s p e c i f i c i t y

(Equation (16)). The equations for

s e n s i t i v i t y

and

s p e c i f i c i t y

are as follows:

S e n s i t i v i t y = \frac{T P}{T P + F N}

(15)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(16)

4. Results

We performed feature selection based on the three feature selection methods based on genetic algorithm, information gain, and wolf search algorithm. Then, the performance of machine learning using the three feature selection methods were compared with that of machine learning not using the feature selection. In this experiment, to obtain a feature subset of 100 of 17,850 features, runs of GA1, GA2, and IG took 16,221 s, 16,056 s, and 1 s, respectively. A run of WSA took 41,273 s to obtain a subset of 73 features out of 17,850 features. Preprocessing, such as normalization or standardization, for using the selected feature as an input value of machine learning is omitted. In this study, fivefold cross-validation was used to evaluate the machine learning models. That is, 4/5, which is 80%, of the total data were used as a training set, and 1/5, which is 20%, of data were used as a test set for each fold. When classifying Parkinson’s disease using the dataset used in this study, it was confirmed that data preprocessing reduced the accuracy of machine learning.

Figure 4 shows the performance of machine learning on the four feature selection methods used in the experiment. In the case of XGBoost, the four evaluation criteria increased when performing feature selection, excluding GA2, compared to when not performing feature selection. In the case of DNN, the accuracy of non-feature selection was approximately 47%, showing lower accuracy than that of XGBoost, but the accuracy of feature selection through GA1 and WSA was approximately 85%, showing the highest performance among the machine learning. In the case of SVM and DT, it can be observed that the model’s performance decreases by performing feature selection using genetic algorithm and WSA.

GA1 showed performance improvements in XGBoost and DNN, GA2 showed performance improvements in DNN, IG showed performance improvements in all machine learning, and WSA showed performance improvements in XGBoost, DNN, and DT.

Figure 5, Figure 6, Figure 7 and Figure 8 show the performance of the four machine learning models using the ROC curves. Each figure shows the mean ROC curve of the five folds and the ROC curve for each fold. In these figures, for each machine learning model, the result with the best feature selection method and the result without feature selection were compared.

As shown in Figure 5 and Figure 6, it can be observed that the performance of machine learning using feature selection improved compared to the case where feature selection was not used. For example, the average AUC of XGBoost learned through feature selection was approximately 87%, and for the DNNs, the average AUC was approximately 89%. Similarly, Figure 7 and Figure 8 show that, for SVM and DT, the model’s AUC increases when using feature selection based on IG.

We also analyzed whether the results with the best feature selection for each machine learning model were statistically significant using t-tests.

Table 4 shows the result. In this table, the p-value of the t-test between the result of each feature selection and that of non-feature selection was also included for statistical analysis. For each machine learning model, the best among the results of five folds is marked in bold. According to the t-test, for XGBoost, DNN, and DT, the performances of the models were improved by using feature selection methods (0.05 >

p

). However, for SVM, there was no significant difference between the result of feature selection and that of non-feature selection, with a

p

value of 0.43 (0.05 <

p

).

5. Discussion

Table 5 lists the performance of XGBoost, DNN, SVM, and DT. The table contains the number of features that showed the best result with the corresponding feature selection method for each machine learning model. The best result among the experimented feature selection methods for each machine learning method is marked in bold. Time cost in the table represents the time spent to train and test the model. In this table, XGBoost shows higher performance and shorter computing time when performing feature selection based on GA1, IG, and WSA than when not performing feature selection. With IG, four performance indicators increased by at least 9.76%, and the CPU time required to train the model was reduced by 26.98 times. DNN showed a low accuracy of 47.27% when executed without selecting features. However, when machine learning used feature selection through GA1 and WSA, the performance was improved to 85.45%. We observed a 37.18% increase in precision, an indicator of how well Parkinson’s disease patients were predicted, when machine learning was performed with WSA. In addition, the CPU time could be reduced by four times when using feature selection. 8 × 10⁻³.

In the case of SVM and DT, four performance indicators decreased when feature selection was performed with GA1 and GA2. However, performance increased when feature selections based on IG and WSA were performed. In addition, measured CPU time decreased when using feature selection methods. Therefore, using the appropriate feature selection methods can improve the performance of machine learning and reduce computing time.

6. Conclusions

In this study, we compared various feature selection methods for machine learning algorithms to classify Parkinson’s disease better. We experimentally validated that various feature selection methods improved the performance of machine learning when classifying Parkinson’s disease using RNA-seq data. For XGBoost and DNN, feature selection methods based on GA1, IG, and WSA improved the performance when classifying Parkinson’s disease. The accuracy of classification by XGBoost increased by 10.00%, from 77.27% to 87.27%, with the IG-based feature selection method. Similarly, the accuracy of DNN also increased with feature selection methods based on WSA and GA1 by 38.18%, from 47.27% to 85.45%. In SVM and DT, feature selection methods based on IG and WSA showed good performance compared to GA-based feature selection. With IG-based feature selection, the classification accuracy of SVM increased from 80.91% to 81.82%, while that of DT increased from 70.00% to 77.27%. The results of these experiments confirmed that feature selection is effective for classification of Parkinson’s disease using RNA-seq data. In addition, the computational costs were reduced for all machine learning models tested when using feature selection.

The experimental results demonstrate that appropriate feature selection could improve the performance of machine learning and reduce computational costs. For future research, it will be necessary to develop an optimization algorithm that can shorten the time cost for feature selection and further improve the performance of machine learning.

Author Contributions

Conceptualization, J.K. and Y.Y.; methodology, J.K. and Y.Y.; software, J.K.; validation, J.K., H.-J.P. and Y.Y.; formal analysis, J.K. and Y.Y.; investigation, H.-J.P. and Y.Y.; resources, J.K. and Y.Y.; data curation, Y.Y.; writing—original draft preparation, J.K.; writing—review and editing, J.K. and Y.Y.; visualization, J.K.; supervision, H.-J.P. and Y.Y.; project administration, H.-J.P. and Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Gachon University research fund of 2021 (GCU-202110240001). This work was also supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2022S1A5C2A07090938), and supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (No. 2022R1F1A1066017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in reference number [38].

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

GA1	Genetic algorithm using Equation (5)
GA2	Genetic algorithm using Equation (6)
IG	Information gain
WSA	Wolf search algorithm
XGBoost	Extreme gradient boosting
DNN	Deep neural network
SVM	Support vector machine
DT	Decision tree

Appendix A. Comparison of the Performances of GA1 and GA2 with Various Number of Features

Table A1 shows the results of GA1 and GA2 with four machine learning models according to the number of features. The best result for each machine learning model is marked in bold among the results of GAs with the experimented number of features. In the table, CPU time spent on training and testing each machine learning model is also presented. For XGBoost and DT, GA2 performed better than GA1 when the number of features was 25 or 50. On the other hand, for SVM, GA1 performed better than GA2 when the number of features was 75 or 100. For DNN, GA1 outperformed GA2 when the number of features was 25, 50, or 100.

Table A1. The results of GA1 and GA2 according to the number of features.

	Number of Features	Model	Accuracy (%)	Std.	Precision	Recall	$F_{1}$	Time Cost (s)
GA1	25	XGBoost	62.73	5.30	64.87	61.31	60.13	1.13
		DNN	67.27	5.30	68.68	66.00	65.64	4.72
		SVM	70.91	4.64	71.56	69.47	69.26	0.66
		DT	56.36	8.43	52.84	54.02	51.16	0.06
	50	XGBoost	68.18	5.75	67.95	67.66	67.42	0.94
		DNN	78.18	5.30	81.81	77.50	77.20	5.04
		SVM	66.36	8.43	66.36	66.49	65.96	0.06
		DT	60.91	6.17	62.82	60.84	59.85	0.07
	75	XGBoost	79.09	9.36	82.27	77.37	77.74	0.90
		DNN	75.45	6.17	78.06	74.17	73.95	4.65
		SVM	79.09	10.60	79.69	78.92	78.64	0.07
		DT	69.09	7.82	68.68	68.12	67.77	0.09
	100	XGBoost	80.91	8.81	80.94	80.53	81.36	1.02
		DNN	85.45	5.30	86.28	85.00	85.24	10.84
		SVM	71.82	6.68	72.39	71.37	71.05	0.08
		DT	62.73	7.82	62.74	62.14	61.83	0.12
GA2	25	XGBoost	70.91	3.64	72.23	70.06	69.66	1.03
		DNN	61.82	3.64	79.46	58.00	50.46	4.45
		SVM	67.27	5.30	68.68	67.29	65.90	1.00
		DT	60.00	6.03	60.15	59.75	58.84	0.05
	50	XGBoost	74.55	6.17	76.18	74.56	73.73	1.12
		DNN	71.82	7.82	72.61	72.17	71.60	4.38
		SVM	73.64	12.33	74.67	73.68	72.63	0.35
		DT	68.18	10.76	69.33	68.70	67.66	0.08
	75	XGBoost	69.09	9.27	73.79	68.91	66.56	0.99
		DNN	84.55	7.39	86.69	83.67	83.72	4.86
		SVM	70.91	5.45	72.00	70.68	70.18	0.07
		DT	59.09	6.43	60.22	60.08	56.84	0.09
	100	XGBoost	72.73	4.07	73.94	72.93	72.26	0.97
		DNN	82.73	5.30	84.33	82.17	82.30	5.68
		SVM	69.09	4.45	69.17	69.20	68.60	0.09
		DT	63.64	4.98	61.93	62.05	60.89	0.10

The best result for each machine learning model is marked in bold among the results of GAs with the experimented number of features.

Among the results of Table A1, the combination of DNN and GA1 with 100 features showed the best performance. On the other hand, the combination of DT and GA2 with 25 features showed the worst performance. For GA2, DNN showed the best performance when the number of features was 75.

References

Borrageiro, G.; Haylett, W.; Seedat, S.; Kuivaniemi, H.; Bardien, S. A review of genome-wide transcriptomics studies in Parkinson’s disease. Eur. J. Neurosci. 2018, 47, 1–16. [Google Scholar] [CrossRef] [PubMed]
Chatterjee, P.; Roy, D. Comparative analysis of RNA-Seq data from brain and blood samples of Parkinson’s disease. Biochem. Biophys. Res. Commun. 2017, 484, 557–564. [Google Scholar] [CrossRef] [PubMed]
Hook, P.W.; McClymont, S.A.; Cannon, G.H.; Law, W.D.; Morton, A.J.; Goff, L.A.; McCallion, A.S. Single-cell RNA-Seq of mouse dopaminergic neurons informs candidate gene selection for sporadic Parkinson disease. Am. J. Hum. Genet. 2018, 102, 427–446. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Prashanth, R.; Roy, S.D.; Mandal, P.K.; Ghosh, S. High-accuracy detection of early Parkinson’s disease through multimodal features and machine learning. Int. J. Med. Inform. 2016, 90, 13–21. [Google Scholar] [CrossRef]
Kakati, T.; Bhattacharyya, D.K.; Kalita, J.K.; Norden-Krichmar, T.M. DEGnext: Classification of differentially expressed genes from RNA-seq data using a convolutional neural network with transfer learning. BMC Bioinform. 2022, 23, 1–18. [Google Scholar] [CrossRef]
Urda, D.; Montes-Torres, J.; Moreno, F.; Franco, L.; Jerez, J.M. Deep learning to analyze RNA-seq gene expression data. In Proceedings of the International Work-Conference on Artificial Neural Networks, Cádiz, Spain, 25 January 2017; pp. 50–59. [Google Scholar]
Xiao, Y.; Wu, J.; Lin, Z.; Zhao, X. A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data. Comput. Methods Programs Biomed. 2018, 166, 99–105. [Google Scholar] [CrossRef]
Eshun, R.B.; Rabby, M.K.M.; Islam, A.K.; Bikdash, M.U. Histological classification of non-small cell lung cancer with RNA-seq data using machine learning models. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Gainesville, FL, USA, 30 April 2021; pp. 1–7. [Google Scholar]
Jiang, X.; Zhao, J.; Qian, W.; Song, W.; Lin, G.N. A generative adversarial network model for disease gene prediction with RNA-seq data. IEEE Access 2020, 8, 37352–37360. [Google Scholar] [CrossRef]
Jiang, X.; Zhang, H.; Duan, F.; Quan, X. Identify Huntington’s disease associated genes based on restricted Boltzmann machine with RNA-seq data. BMC Bioinform. 2017, 18, 447. [Google Scholar] [CrossRef] [Green Version]
Shokhirev, M.N.; Johnson, A.A. An integrative machine-learning meta-analysis of high-throughput omics data identifies age-specific hallmarks of Alzheimer’s disease. Ageing Res. Rev. 2022, 81, 101721. [Google Scholar] [CrossRef]
Oh, S.L.; Hagiwara, Y.; Raghavendra, U.; Yuvaraj, R.; Arunkumar, N.; Murugappan, M.; Acharya, U.R. A deep learning approach for Parkinson’s disease diagnosis from EEG signals. Neural Comput. Appl. 2020, 32, 10927–10933. [Google Scholar] [CrossRef]
Sivaranjini, S.; Sujatha, C. Deep learning based diagnosis of Parkinson’s disease using convolutional neural network. Multimed. Tools Appl. 2020, 79, 15467–15479. [Google Scholar] [CrossRef]
Pahuja, G.; Nagabhushan, T.; Prasad, B. Early detection of Parkinson’s disease by using SPECT imaging and biomarkers. J. Intell. Syst. 2020, 29, 1329–1344. [Google Scholar] [CrossRef]
Vásquez-Correa, J.C.; Arias-Vergara, T.; Orozco-Arroyave, J.R.; Eskofier, B.; Klucken, J.; Nöth, E. Multimodal assessment of Parkinson’s disease: A deep learning approach. IEEE J. Biomed. Health Inform. 2018, 23, 1618–1630. [Google Scholar] [CrossRef]
Ghaheri, P.; Nasiri, H.; Shateri, A.; Homafar, A. Diagnosis of Parkinson’s Disease Based on Voice Signals Using SHAP and Hard Voting Ensemble Method. arXiv 2022, arXiv:2210.01205. [Google Scholar]
Maskeliūnas, R.; Damaševičius, R.; Kulikajevas, A.; Padervinskis, E.; Pribuišis, K.; Uloza, V. A hybrid U-lossian deep learning network for screening and evaluating Parkinson’s disease. Appl. Sci. 2022, 12, 11601. [Google Scholar] [CrossRef]
Yang, Y.; Yuan, Y.; Zhang, G.; Wang, H.; Chen, Y.-C.; Liu, Y.; Tarolli, C.G.; Crepeau, D.; Bukartyk, J.; Junna, M.R. Artificial intelligence-enabled detection and assessment of Parkinson’s disease using nocturnal breathing signals. Nat. Med. 2022, 28, 2207–2215. [Google Scholar] [CrossRef]
Kalari, K.R.; Nair, A.A.; Bhavsar, J.D.; O’Brien, D.R.; Davila, J.I.; Bockol, M.A.; Nie, J.; Tang, X.; Baheti, S.; Doughty, J.B. MAP-RSeq: Mayo analysis pipeline for RNA sequencing. BMC Bioinform. 2014, 15, 224. [Google Scholar] [CrossRef] [Green Version]
Eswaran, J.; Cyanam, D.; Mudvari, P.; Reddy, S.D.N.; Pakala, S.B.; Nair, S.S.; Florea, L.; Fuqua, S.A.; Godbole, S.; Kumar, R. Transcriptomic landscape of breast cancers through mRNA sequencing. Sci. Rep. 2012, 2, 264. [Google Scholar] [CrossRef] [Green Version]
Oh, I.-S.; Lee, J.-S.; Moon, B.-R. Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1424–1437. [Google Scholar]
Siedlecki, W.; Sklansky, J. A note on genetic algorithms for large-scale feature selection. Pattern Recognit. Lett. 1989, 10, 335–347. [Google Scholar] [CrossRef]
Sakri, S.B.; Rashid, N.B.A.; Zain, Z.M. Particle swarm optimization feature selection for breast cancer recurrence prediction. IEEE Access 2018, 6, 29637–29647. [Google Scholar] [CrossRef]
Suthar, V.; Vakharia, V.; Patel, V.K.; Shah, M. Detection of Compound Faults in Ball Bearings Using Multiscale-SinGAN, Heat Transfer Search Optimization, and Extreme Learning Machine. Machines 2022, 11, 29. [Google Scholar] [CrossRef]
Nadimi-Shahraki, M.H.; Asghari Varzaneh, Z.; Zamani, H.; Mirjalili, S. Binary Starling Murmuration Optimizer Algorithm to Select Effective Features from Medical Data. Appl. Sci. 2022, 13, 564. [Google Scholar] [CrossRef]
Zamani, H.; Nadimi-Shahraki, M.H.; Gandomi, A.H. Starling murmuration optimizer: A novel bio-inspired algorithm for global and engineering optimization. Comput. Methods Appl. Mech. Eng. 2022, 392, 114616. [Google Scholar] [CrossRef]
Yamany, W.; Emary, E.; Hassanien, A.E. Wolf search algorithm for attribute reduction in classification. In Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Orlando, FL, USA, 9–12 December 2014; pp. 351–358. [Google Scholar]
Emary, E.; Zawbaa, H.M.; Hassanien, A.E. Binary grey wolf optimization approaches for feature selection. Neurocomputing 2016, 172, 371–381. [Google Scholar] [CrossRef]
Li, Q.; Chen, H.; Huang, H.; Zhao, X.; Cai, Z.; Tong, C.; Liu, W.; Tian, X. An enhanced grey wolf optimization based feature selection wrapped kernel extreme learning machine for medical diagnosis. Comput. Math. Methods Med. 2017, 2017, 1–15. [Google Scholar] [CrossRef] [Green Version]
de-Prado-Gil, J.; Palencia, C.; Jagadesh, P.; Martínez-García, R. A Comparison of Machine Learning Tools That Model the Splitting Tensile Strength of Self-Compacting Recycled Aggregate Concrete. Materials 2022, 15, 4164. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Wang, D.; Zhang, Y.; Zhao, Y. LightGBM: An effective miRNA classification method in breast cancer patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics, Newark, NJ, USA, 18–20 October 2017; pp. 7–11. [Google Scholar]
Li, W.; Yin, Y.; Quan, X.; Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet. 2019, 10, 1077. [Google Scholar] [CrossRef] [Green Version]
Sarkhani Benemaran, R.; Esmaeili-Falak, M.; Javadi, A. Predicting resilient modulus of flexible pavement foundation using extreme gradient boosting based optimised models. Int. J. Pavement Eng. 2022, 1–20. [Google Scholar] [CrossRef]
Zararsız, G.; Goksuluk, D.; Korkmaz, S.; Eldem, V.; Zararsiz, G.E.; Duru, I.P.; Ozturk, A. A comprehensive simulation study on classification of RNA-Seq data. PLoS ONE 2017, 12, e0182507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Khalifa, N.E.M.; Taha, M.H.N.; Ali, D.E.; Slowik, A.; Hassanien, A.E. Artificial intelligence technique for gene expression by tumor RNA-Seq data: A novel optimized deep learning approach. IEEE Access 2020, 8, 22874–22883. [Google Scholar] [CrossRef]
Xiao, Y.; Wu, J.; Lin, Z.; Zhao, X. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 2018, 153, 1–9. [Google Scholar] [CrossRef] [PubMed]
Dumitriu, A.; Golji, J.; Labadorf, A.T.; Gao, B.; Beach, T.G.; Myers, R.H.; Longo, K.A.; Latourelle, J.C. Integrative analyses of proteomics and RNA transcriptomics implicate mitochondrial processes, protein folding pathways and GWAS loci in Parkinson disease. BMC Med. Genom. 2015, 9, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kakati, T.; Bhattacharyya, D.K.; Kalita, J.K. DEGnet: Identifying differentially expressed genes using deep neural network from RNA-Seq datasets. In Proceedings of the Pattern Recognition and Machine Intelligence: 8th International Conference, PReMI 2019, Tezpur, India, 17–20 December 2019; Proceedings, Part II. pp. 130–138. [Google Scholar]
Kukurba, K.R.; Montgomery, S.B. RNA sequencing and analysis. Cold Spring Harb. Protoc. 2015, 2015, pdb-top084970. [Google Scholar] [CrossRef] [Green Version]
Negi, A.; Shukla, A.; Jaiswar, A.; Shrinet, J.; Jasrotia, R.S. Applications and challenges of microarray and RNA-sequencing. Bioinformatics 2022, 91–103. [Google Scholar] [CrossRef]
Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63. [Google Scholar] [CrossRef]
Kogenaru, S.; Yan, Q.; Guo, Y.; Wang, N. RNA-seq and microarray complement each other in transcriptome profiling. BMC Genom. 2012, 13, 629. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Wang, S.; Summers, R.M. Machine learning and radiology. Med. Image Anal. 2012, 16, 933–951. [Google Scholar] [CrossRef] [Green Version]
Saeys, Y.; Inza, I.; Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef] [Green Version]
Ghosh, M.; Guha, R.; Sarkar, R.; Abraham, A. A wrapper-filter feature selection technique based on ant colony optimization. Neural Comput. Appl. 2020, 32, 7839–7857. [Google Scholar] [CrossRef]
Wah, Y.B.; Ibrahim, N.; Hamid, H.A.; Abdul-Rahman, S.; Fong, S. Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika J. Sci. Technol. 2018, 26. [Google Scholar]
Liu, X.-Y.; Liang, Y.; Wang, S.; Yang, Z.-Y.; Ye, H.-S. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access 2018, 6, 22863–22874. [Google Scholar] [CrossRef]
Maldonado, S.; Weber, R. A wrapper method for feature selection using support vector machines. Inf. Sci. 2009, 179, 2208–2217. [Google Scholar] [CrossRef]
Holland, J.H. Genetic algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
Mirjalili, S.; Song Dong, J.; Sadiq, A.S.; Faris, H. Genetic algorithm: Theory, literature review, and application in image reconstruction. Nat.-Inspired Optim. 2020, 69–85. [Google Scholar]
Kumar, M.; Husain, D.; Upreti, N.; Gupta, D. Genetic algorithm: Review and application. SSRN 3529843 2010. [Google Scholar] [CrossRef]
Eiben, A.E.; Raue, P.-E.; Ruttkay, Z. Genetic algorithms with multi-parent recombination. In Proceedings of the International Conference on Parallel Problem Solving from Nature, Jerusalem, Israel, 21 September 1994; pp. 78–87. [Google Scholar]
Whitley, D. A genetic algorithm tutorial. Stat. Comput. 1994, 4, 65–85. [Google Scholar] [CrossRef]
Lei, S. A feature selection method based on information gain and genetic algorithm. In Proceedings of the 2012 International Conference on Computer Science and Electronics Engineering, Hangzhou, China, 23–25 May 2012; pp. 355–358. [Google Scholar]
Baobao, W.; Jinsheng, M.; Minru, S. An enhancement of K-Nearest Neighbor algorithm using information gain and extension relativity. In Proceedings of the 2008 International Conference on Condition Monitoring and Diagnosis, Beijing, China, 21–24 April 2008; pp. 1314–1317. [Google Scholar]
Agbehadji, I.E.; Fong, S.; Millham, R. Wolf search algorithm for numeric association rule mining. In Proceedings of the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 5–7 July 2016; pp. 146–151. [Google Scholar]
Tang, R.; Fong, S.; Yang, X.-S.; Deb, S. Wolf search algorithm with ephemeral memory. In Proceedings of the Seventh International Conference on Digital Information Management (ICDIM 2012), Macau, Macao, 22–24 August 2012; pp. 165–172. [Google Scholar]
Li, J.; Fong, S.; Wong, R.K.; Millham, R.; Wong, K.K. Elitist binary wolf search algorithm for heuristic feature selection in high-dimensional bioinformatics datasets. Sci. Rep. 2017, 7, 4354. [Google Scholar] [CrossRef] [Green Version]
Wei, J.; Chu, X.; Sun, X.Y.; Xu, K.; Deng, H.X.; Chen, J.; Wei, Z.; Lei, M. Machine learning in materials science. InfoMat 2019, 1, 338–358. [Google Scholar] [CrossRef] [Green Version]
Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef] [Green Version]
Machalek, D.; Quah, T.; Powell, K.M. A novel implicit hybrid machine learning model and its application for reinforcement learning. Comput. Chem. Eng. 2021, 155, 107496. [Google Scholar] [CrossRef]
Fatahi, R.; Nasiri, H.; Dadfar, E.; Chehreh Chelgani, S. Modeling of energy consumption factors for an industrial cement vertical roller mill by SHAP-XGBoost: A” conscious lab” approach. Sci. Rep. 2022, 12, 7543. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Qian, L.; Mao, B.; Huang, C.; Huang, B.; Si, Y. A data-driven design for fault detection of wind turbines using random forests and XGboost. IEEE Access 2018, 6, 21020–21031. [Google Scholar] [CrossRef]
Jiang, H.; He, Z.; Ye, G.; Zhang, H. Network intrusion detection based on PSO-XGBoost model. IEEE Access 2020, 8, 58392–58401. [Google Scholar] [CrossRef]
Nasiri, H.; Homafar, A.; Chelgani, S.C. Prediction of uniaxial compressive strength and modulus of elasticity for Travertine samples using an explainable artificial intelligence. Results Geophys. Sci. 2021, 8, 100034. [Google Scholar] [CrossRef]
Ogunleye, A.; Wang, Q.-G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 17, 2131–2140. [Google Scholar] [CrossRef]
Song, K.; Yan, F.; Ding, T.; Gao, L.; Lu, S. A steel property optimization model based on the XGBoost algorithm and improved PSO. Comput. Mater. Sci. 2020, 174, 109472. [Google Scholar] [CrossRef]
Lin, B.; Huang, Y.; Zhang, J.; Hu, J.; Chen, X.; Li, J. Cost-driven off-loading for DNN-based applications over cloud, edge, and end devices. IEEE Trans. Ind. Inform. 2019, 16, 5456–5466. [Google Scholar] [CrossRef] [Green Version]
Wu, Y.-C.; Lee, Y.-S.; Yang, J.-C. Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognit. 2008, 41, 2874–2889. [Google Scholar] [CrossRef]
Chapelle, O.; Haffner, P.; Vapnik, V.N. Support vector machines for histogram-based image classification. IEEE Trans. Neural Netw. 1999, 10, 1055–1064. [Google Scholar] [CrossRef] [PubMed]
Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom. Proteom. 2018, 15, 41–51. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hussain, M.; Wajid, S.K.; Elzaart, A.; Berbar, M. A comparison of SVM kernel functions for breast cancer detection. In Proceedings of the 2011 Eighth International Conference Computer Graphics, Imaging and Visualization, Singapore, 17–19 August 2011; pp. 145–150. [Google Scholar]
Kouziokas, G.N. SVM kernel based on particle swarm optimized vector and Bayesian optimized SVM in atmospheric particulate matter forecasting. Appl. Soft Comput. 2020, 93, 106410. [Google Scholar] [CrossRef]
Butler, K.T.; Davies, D.W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine learning for molecular and materials science. Nature 2018, 559, 547–555. [Google Scholar] [CrossRef] [Green Version]
Patel, H.H.; Prajapati, P. Study and analysis of decision tree based classification algorithms. Int. J. Comput. Sci. Eng. 2018, 6, 74–78. [Google Scholar] [CrossRef]
Haupt, R.L. Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. In Proceedings of the IEEE Antennas and Propagation Society International Symposium. Transmitting Waves of Progress to the Next Millennium. 2000 Digest. Held in Conjunction with: USNC/URSI National Radio Science Meeting, Salt Lake City, UT, USA, 16–21 July 2000; pp. 1034–1037. [Google Scholar]
Kim, Y.-H.; Yoon, Y. A genetic filter for cancer classification on gene expression data. Bio-Med. Mater. Eng. 2015, 26, S1993–S2002. [Google Scholar] [CrossRef] [Green Version]
Fong, S.; Biuk-Aghai, R.P.; Millham, R.C. Swarm search methods in weka for data mining. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing, Macau, China, 26–28 February 2018; pp. 122–127. [Google Scholar]
Gnanambal, S.; Thangaraj, M.; Meenatchi, V.; Gayathri, V. Classification algorithms with attribute selection: An evaluation study using WEKA. Int. J. Adv. Netw. Appl. 2018, 9, 3640–3644. [Google Scholar]
Hall, M.A. Correlation-Based Feature Subset Selection for Machine Learning. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand, 1998. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June2013; pp. 1139–1147. [Google Scholar]
Suthaharan, S. (Ed.) Decision Tree Learning. In Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning; Springer: Boston, MA, USA, 2016; pp. 237–269. [Google Scholar]

Figure 1. Flow chart of the genetic algorithm.

Figure 2. Fitness values for each generation (GA1).

Figure 3. Fitness values for each generation (GA2).

Figure 4. Graph comparing the four performance indicators of feature selection methods: (a) XGBoost; (b) DNN; (c) SVM; and (d) DT.

Figure 5. ROC curve of XGBoost: (a) non-feature selection; (b) IG.

Figure 6. ROC curve of DNN: (a) non-feature selection; (b) GA1.

Figure 7. ROC curve of SVM: (a) non-feature selection; (b) IG.

Figure 8. ROC curve of DT: (a) non-feature selection; (b) IG.

Table 1. Summarized statistics on RNA-seq by A. Dumitriu et al. [38].

Analysis	Description	Control	PD	t-Test p
RNA-seq	Number of samples	44	29	-
	Age at death, years (range)	70.00 (46–97)	77.55 (64–95)	4.6 × 10⁻³
	PMI, hours (range)	14.36 (2–32)	11.14 (1–31)	1.7 × 10⁻¹
	RIN (range)	7.85 (6.0–9.1)	7.07 (5.8–8.5)	5.9 × 10⁻⁵

Table 2. Hyperparameters of the classification algorithm used in this experiment.

Model	Parameter	Explanation	Value
XGBoost	n_estimators	Specify the number of gradient boosted trees	100
	learning_rate	Determine the percentage of improvement in learning rates	0.1
	gamma	Minimum loss reduction required to make a further partition on a leaf node of the tree	0
	max_depth	Maximum tree depth for base learners	6
	objective	Specify corresponding learning objectives	binary
SVM	kernel	Sets the type of kernel that the algorithm uses to classify	linear

Table 3. DNN structures generated using data selected by feature selection.

Layer	Output Shape	Param #
Input	51, 100	-
Linear-1	51, 128	12,928
Dropout-2	51, 128	0
Linear-3	51, 128	16,512
Linear-4	51, 64	8256
Linear-5	51, 64	4160
Linear-6	51, 32	2090
Linear-7	51, 32	1056
Linear-8	51, 16	528
Linear-9	51, 1	17

Table 4. Best Feature Selection Method by Machine Learning Model, Number of Features, and Machine Learning Performance by Fold.

Feature Selection Method	Number of Features	ML	Fold	Accuracy	Precision	Recall	$F_{1}$	$p$ Value between the Feature Selection and Non-Feature Selection
IG	50	XGBoost	1	86.36	89.28	86.36	86.10	2.57 × 10⁻⁴
			2	90.90	90.83	90.83	90.83
			3	95.45	95.45	95.83	95.44
			4	81.81	83.03	80.83	81.19
			5	81.81	80.35	80.35	80.35
GA1	100	DNN	1	90.90	92.85	90.00	90.59	3.14 × 10⁻¹³
			2	81.81	81.66	81.66	81.66
			3	86.36	86.75	85.83	86.10
			4	90.90	92.85	90.00	90.59
			5	77.27	77.27	77.50	77.22
IG	50	SVM	1	72.72	79.46	77.27	76.84	4.38 × 10⁻¹
			2	90.90	90.83	90.83	90.83
			3	72.72	73.33	73.33	72.72
			4	72.72	79.52	75.83	76.03
			5	90.90	90.17	90.17	90.17
WSA	73	DT	1	68.18	68.33	68.18	68.11	2.41 × 10⁻⁴
			2	95.45	95.45	95.83	95.44
			3	77.27	77.27	77.50	77.22
			4	81.81	83.03	80.83	81.19
			5	63.63	62.50	63.39	62.39

Table 5. Comparison of accuracy, precision, recall,

F_{1}

score, and time cost of machine learning for feature selection methods and non-feature selection.

Table 5. Comparison of accuracy, precision, recall,

F_{1}

score, and time cost of machine learning for feature selection methods and non-feature selection.

	Number of Features	Model	Accuracy (%)	Precision	Recall	$F_{1}$	Time Cost (s)
Non-feature selection	17,850	XGBoost	77.27	76.60	76.44	77.03	26.45
		DNN	47.27	50.00	32.06	23.64	16.74
		SVM	80.91	80.20	80.43	82.52	38.28
		DT	70.00	68.62	68.59	70.51	11.44
GA1	75	XGBoost	80.91	80.94	80.53	81.36	1.02
	75	DNN	85.45	86.28	85.00	85.24	10.84
	50	SVM	79.09	79.69	78.92	78.64	0.07
	50	DT	69.09	68.68	68.12	67.77	0.09
GA2	50	XGBoost	74.55	76.18	74.56	73.73	1.12
	75	DNN	84.55	86.69	83.67	83.72	4.86
	50	SVM	73.64	74.67	73.68	72.63	0.35
	50	DT	68.18	69.33	68.70	67.66	0.08
IG	50	XGBoost	87.27	87.79	86.84	86.79	0.98
	100	DNN	81.82	83.96	80.83	81.16	6.56
	50	SVM	81.82	82.67	81.49	81.32	0.06
	50	DT	77.27	78.90	77.37	76.43	0.08
WSA	73	XGBoost	86.36	89.04	84.95	85.46	0.77
	73	DNN	85.45	87.18	84.50	84.89	4.61
	8358	SVM	80.00	81.55	79.10	79.33	5.85
	73	DT	77.27	77.32	77.15	76.88	0.09

The best result among the experimented feature selection methods for each machine learning method is marked in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Park, H.-J.; Yoon, Y. RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning. Appl. Sci. 2023, 13, 2698. https://doi.org/10.3390/app13042698

AMA Style

Kim J, Park H-J, Yoon Y. RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning. Applied Sciences. 2023; 13(4):2698. https://doi.org/10.3390/app13042698

Chicago/Turabian Style

Kim, Jingeun, Hye-Jin Park, and Yourim Yoon. 2023. "RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning" Applied Sciences 13, no. 4: 2698. https://doi.org/10.3390/app13042698

APA Style

Kim, J., Park, H. -J., & Yoon, Y. (2023). RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning. Applied Sciences, 13(4), 2698. https://doi.org/10.3390/app13042698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RNA Sequences-Based Diagnosis of Parkinson’s Disease Using Various Feature Selection Methods and Machine Learning

Abstract

1. Introduction

2. Background

2.1. RNA Sequencing (RNA-seq)

2.2. Dataset Used for This Study

2.3. Feature Selection

2.4. Genetic Algorithm (GA)

2.5. Information Gain (IG)

2.6. Wolf Search Algorithm (WSA)

2.7. Machine Learning Algorithms for Classification

3. Method

3.1. Feature Selection Using Genetic Algorithms

3.2. Feature Selection Using Information Gain

3.3. Feature Selection Using Wolf Search Algorithm

3.4. Machine Learning Model Implementation

3.5. Model Evaluation

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Comparison of the Performances of GA1 and GA2 with Various Number of Features

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI