Stability of Feature Selection in Multi-Omics Data Analysis

Łukaszuk, Tomasz; Krawczuk, Jerzy; Żyła, Kamil; Kęsik, Jacek

doi:10.3390/app142311103

Open AccessArticle

Stability of Feature Selection in Multi-Omics Data Analysis

by

Tomasz Łukaszuk

¹

,

Jerzy Krawczuk

¹

,

Kamil Żyła

²

and

Jacek Kęsik

^2,*

¹

Faculty of Computer Science, Bialystok University of Technology, Wiejska 45A, 15-351 Bialystok, Poland

²

Department of Computer Science, Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, Nadbystrzycka 36B, 20-618 Lublin, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11103; https://doi.org/10.3390/app142311103

Submission received: 8 November 2024 / Revised: 26 November 2024 / Accepted: 27 November 2024 / Published: 28 November 2024

(This article belongs to the Special Issue Recent Progress and Challenges of Artificial Intelligence in Bioinformatics and New Medicine)

Download

Browse Figures

Versions Notes

Abstract

:

In the rapidly evolving field of multi-omics data analysis, understanding the stability of feature selection is critical for reliable biomarker discovery and clinical applications. This study investigates the stability of feature-selection methods across various cancer types by utilizing 15 datasets from The Cancer Genome Atlas (TCGA). We employed classifiers with embedded feature selection, including Support Vector Machines (SVM), Logistic Regression (LR), and Lasso regression, each incorporating L1 regularization. Through a comprehensive evaluation using five-fold cross-validation, we measured feature-selection stability and assessed the accuracy of predictions regarding TP53 mutations, a known indicator of poor clinical outcomes in cancer patients. All three classifiers demonstrated optimal feature-selection stability, measured by the Nogueira metric, with higher regularization (fewer selected features), while lower regularization generally resulted in decreased stability across all omics layers. Our findings indicate differences in feature stability across the various omics layers; mirna consistently exhibited the highest stability across classifiers, while the mutation and rna layers were generally less stable, particularly with lower regularization. This work highlights the importance of careful feature selection and validation in high-dimensional datasets to enhance the robustness and reliability of multi-omics analyses.

Keywords:

multi-omics; high-dimensional data; cancer genomics; feature selection; stability; L1 regularization

1. Introduction

Multi-omics is the approach whereby datasets contain multiple layers of data, which can be referred to as different “omes”. Informally, names of science branches end in the suffix “-omics” (e.g., genomics in biology), which is related to the “-ome” suffix, which is used to address the objects of study of such fields (e.g., genome). Combining different “omes” (different layers of data) creates complex data that can be integrated and then analyzed to find novel associations. We can understand systems better through various omics layers that reveal supplementary sources of variability [1].

Transition to multi-omics data has opened new directions of research concerning integrated system-level approaches [2]. Multi-omics data allow the derivation of insights from highly interrelated data and facilitate multi-scale depiction of systems, including biological ones [1]. We can see such applications, among others, in the following domains: disease subtyping, biomarker prediction, treatment optimization, disease prognosis, food safety/microbial risk assessment, toxicology [3], and many more. Even culture is affected, e.g., in terms of quantitative analysis and exploration of cultural trends, which is based on news records, digitized books, social media contributions, etc. [4,5,6].

We also observe an increasing number of multi-omics data repositories from the following initiatives: The Cancer Genome Atlas repository, a joint effort between the National Cancer Institute and the National Human Genome Research Institute, which generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data [7,8]; and the International Cancer Genomics Consortium repository, a global initiative to build a comprehensive catalog of mutational abnormalities in the major tumor types [9,10]. Due to the dispersion of dataset publishing spaces, indexing services have been developed, e.g., the Omics Discovery Index, that participate in the discovery and linking of public omics datasets [11,12]. Security policies have gained importance as well [13].

Multi-omics data have put new force into biomedical research, especially by allowing systematic modeling and a comprehensive understanding of mechanisms behind disease progression and associated risk management [14,15,16]. Naturally, we cannot imagine these applications without machine-learning (ML) and artificial-intelligence (AI) support, as deriving conclusions from these data is one of the most demanding tasks [17,18]. Recently, deep learning (DL) has gained high interest due to its performance and the capability of capturing nonlinear and hierarchical features [14,19,20,21].

In the field of machine learning, classification tasks, data mining, and exploration often require effective feature extraction and selection to improve model performance. Feature selection becomes particularly important in multi-omics data analysis, where the number of features (e.g., genes, proteins, metabolites) far exceeds the number of observations (samples or instances). This imbalance between the dimensionality of the data and the available observations necessitates a careful selection of features, as using all available features can lead to overfitting and reduce the model’s ability to generalize.

In classification tasks, especially when dealing with datasets where the number of features exceeds the number of observations, it is not feasible for the model to make decisions based on all features; instead, it relies on a subset of features that are selected during the learning process. This subset is selected to maximize predictive power while minimizing complexity and overfitting, representing a trade-off between model accuracy and the risk of overfitting. By selecting a smaller set of relevant features, the model is better able to generalize to unseen data, but this comes at the cost of potentially losing some predictive power if too many features are excluded.

The process of selecting features as part of the model training is known as embedded feature selection and is often referred to as an “intrinsic” or “embedded” method [22]. Among the embedded methods, feature selection is most commonly performed by techniques based on the L1 norm (such as Lasso regression) and decision tree-based methods (such as Random Forest), both of which incorporate feature selection into the model-fitting process [23]. In contrast, feature selection can also be done as a separate step (prior to model training), but such an approach still often requires building a classifier on the selected features to evaluate its performance, ultimately linking the quality of feature selection with the classifier’s performance.

The importance of feature selection in multi-omics data has been well recognized, and several studies have benchmarked various feature-selection methods. For instance, Li et al. [24] conducted a large-scale benchmark experiment comparing different feature-selection methods, providing valuable insights into the effectiveness of various approaches. The benchmark was based on 15 cancer datasets from TCGA [8]. According to the authors, no such benchmark was performed before their work. After literature inquiry, including [25,26], they chose the following 8 feature-selection methods (the most frequently used in context of cancer classification): T-test [27], Information gain (infor) [28], ReliefF [29], Minimum Redundancy Maximum Relevance (mRMR) [30], Recursive Feature Elimination (RFE) [31], Genetic Algorithm (GA) [32], Least Absolute Shrinkage and Selection Operator (Lasso) [33], and permutation importance of Random Forest (RF-VI) [34]. In order to assess the predictive performance, accuracy, and the AUC, the Brier test was used. Friedman’s test was applied to test for differences between the dataset-specific performance measure values obtained with different methods.

Li et al. [24] concluded that feature selection is an especially important topic, as the proper selection of data features may improve prediction accuracy and reduce the complexity of machine-learning models. Based on the benchmark results, the following main statements were made: (1) RF-VI (method of embedded type [34]) and mRMR (method of filter type [30]) can be used, where it is sufficient to use only small numbers of best features; (2) Lasso (method of embedded type [33]) provides comparable or slightly better predictive performance when compared to the previously mentioned methods, although models have a noticeably bigger number of features; (3) the mRMR method is computationally expensive.

Unfortunately, the authors focused on predictive accuracy, missing another important aspect, that of stability of feature selection, which creates another gap in the research. Specifically, it remains unclear how consistent and reliable the selected features are across various classification models and datasets. To sum things up, the importance lies in the better reproducibility of findings derived from multi-omics analyses and better resiliency to minor changes in the training dataset.

The stability of feature selection is an emerging topic. Khaire and Dhanalakshmi [35] provided a review of works concerning feature-selection techniques, keeping in mind different sources of instability and the high dimensionality of data. They noted that a high correlation of features may frequently produce multiple equally optimal signatures, reducing the confidence of selected features. Up-to-date analysis and comparison of feature-selection methods were provided by Barbieri et al. [36]. They focused on comparing different stability metrics, claiming their work to be an extensive and complete review of feature-selection algorithms and their evaluation. They mention challenges like high-dimensional data, small samples, noisy and redundant features, and biased data. Metrics for selection accuracy, selection redundancy, prediction performance, algorithmic stability, selection reliability, and computational time were employed. Furthermore, Lazebnik and Rosenfeld [37] showed the shortcomings of existing stability metrics, unable to properly address data drift or non-uniformly distributed missing values. A new solution inspired by Lyapunov stability in dynamic systems was proposed. Finally, tightly targeted works also exist, such as COVID-19-oriented ones [38,39].

As we can see, much work has been devoted either to feature-selection algorithms for medical data or to the general stability of feature selection. Nevertheless, to the authors’ best knowledge, no one has tried to address the stability of feature selection in the case of cancer multi-omics TCAG datasets. These aim to enrich the original work by Li et al. [24] by extending their findings to stability aspects, which seems to be valuable for the machine-learning community.

To summarize, in this paper:

we emphasize the importance of testing feature-selection stability when assessing the quality of a classification model applied to multi-omic data;
we empirically study the accuracy and feature-selection stability in high-dimension multi-omics data and analyze 15 datasets and 3 classification methods with embedded feature selection;
we propose to combine accuracy with feature-selection stability for classifier evaluation.

The rest of the paper is organized as follows: in Section 2, we describe the classifiers used, feature-stability measures, in particular the measure defined by Nogueira, the datasets used, and the experimental setup, and offer a few words on the evaluation of classification models. In Section 3, we present the results of the experiments. In Section 4, we discuss the outcome we achieved, particularly within the context of the most recent literature. Finally, we conclude in Section 5.

2. Materials and Methods

2.1. Classifiers with Embedded Feature Selection

In this study, we employed classifiers with embedded feature-selection mechanisms based on the L1 norm. The L1 norm, or Lasso regularization, is particularly effective in high-dimensional settings where the number of features exceeds the number of observations, as it encourages sparsity in the feature space. The classifiers used in our analysis include Support Vector Machine (SVM), Logistic Regression (LR), and Lasso regression, each of which integrates feature selection directly within the model-training process.

Let the dataset consist of m observations

O_{j}

(

j = 1, . . ., m

). Each observation

O_{j}

is described by n features and represented as an n-dimensional feature vector

x_{j} = {[x_{j 1}, . . ., x_{j n}]}^{T}

. The components

x_{j i}

of the vectors

x_{j}

are referred to as features. Each feature vector

x_{j}

is associated with a binary label (or class)

y_{j}

. The positive class, indicating the presence of a disease, is typically denoted by 1, while the negative class can be represented as either 0 (as used in logistic and linear regression) or

- 1

(as is common in Support Vector Machine models). Consequently, we can express the general-criterion function as follows:

Φ_{C} (w, θ) = \frac{1}{m} \cdot \sum_{j = 1}^{m} e r r o r (y_{j}, \hat{y} (x_{j}, w, θ)) + \frac{1}{C} \sum_{i = 1}^{n} | w_{i} |,

(1)

where

w = {[w_{1}, . . ., w_{n}]}^{T} \in R^{n}

represents the weight vector, and

θ \in R^{1}

denotes the threshold of the hyperplane

H (w, θ)

. These parameters are determined during the minimization of the criterion function

Φ_{C} (w, θ)

.

The criterion function in Equation (1) consists of two parts. The first term represents the loss function, which measures the difference between the true values

y_{j}

and the predicted values

\hat{y} (x_{j}, w, θ)

. The second term is the regularization term, which penalizes the values of the model weights

w

.

While the entire function needs to be minimized during training, the regularization term ensures that the model weights remain small. The hyperparameter C controls the relative importance of the two terms: a smaller value of C gives more weight to regularization (encouraging smaller weights), while a larger value of C emphasizes minimizing the loss function, allowing for higher weights. The value of C must be set prior to optimization and determines the trade-off between model fit and complexity.

Using the absolute values (the L1 norm) of the coefficients

| w_{i} |

in the second regularization term has the property that during optimization, the weights not only become smaller but actually tend to become exactly zero, thus performing feature selection. In the case of the L2 norm, where squared weights are used in the regularization term

w_{i}^{2}

, the weights tend to go towards zero but not exactly zero. This property of L1 regularization makes it useful for high-dimensional datasets.

A lower value of C reduces the weight assigned to model errors while increasing the strength of regularization. This results in more weights being driven to zero, leading to a sparser model with fewer features. Conversely, a higher value of C increases model complexity by allowing the inclusion of more features, enabling the model to fit the training data more closely. This close fit is generally undesirable in the context of multi-omics data, where the number of features n significantly exceeds the number of observations m.

2.1.1. Support Vector Machines (SVM)

Support Vector Machines [40] are powerful supervised-learning models utilized for classification tasks. By incorporating L1 regularization, SVMs can effectively perform feature selection while constructing a hyperplane that maximizes the margin between different classes. In our experiments, we employed the linear version of the kernel. An observation

x_{i}

is classified as belonging to the positive class

y = 1

if it is located on the positive side of the hyperplane

H (w, θ)

; otherwise, it is assigned to the negative class.

\hat{y_{i}} = s g n (w^{T} x_{i} + θ)

(2)

Assuming

y_{i} \in {- 1, 1}

, the error function can be expressed as a hinge loss function [41]:

e r r o r_{i} = m a x (0, 1 - y_{i} (w^{T} x_{i} + θ))

(3)

The error, or loss, is equal to 0 if the observation

x_{i}

lies on the correct side of the hyperplane according to its true label

y_{i}

.

2.1.2. Logistic Regression (LR)

Logistic Regression [42] is a widely used classifier that models the probability of a binary outcome based on one or more predictor variables. By applying L1 regularization, logistic regression can effectively perform feature selection by shrinking less informative coefficients to zero, therefore eliminating them from the model. The model prediction

\hat{y}

is given by a logistic function, with values in the range (0, 1) interpreted as the probability that

x_{i}

belongs to the positive class

y_{i} = 1

, while 1-probability represents the likelihood of belonging to the negative class. The negative class, in the case of LR, is usually labeled as 0.

\hat{y_{i}} = \frac{1}{1 + e^{- (w^{T} x_{i} + θ)}}

(4)

e r r o r_{i} = - l o g (\hat{y_{i}}) i f y_{i} = 1 e l s e - l o g (1 - \hat{y_{i}})

(5)

2.1.3. Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression [43] is a linear regression method that incorporates L1 regularization to prevent overfitting and enhance feature selection.

\hat{y_{i}} = w^{T} x_{i} + θ

(6)

e r r o r_{i} = y_{i} - \hat{y_{i}}

(7)

In our binary classification problem, where

y_{i} \in {0, 1}

, we use a threshold of

0.5

to assign the class

y_{i} = 1

if

{\hat{y}}_{i} > 0.5

and the class

y_{i} = 0

otherwise.

2.2. Feature-Selection Stability Measure

One of the first measures used to assess stability was based on the intersection of feature sets, such as the Jaccard Index [44]. These measures allow for the evaluation of similarity between feature sets with different cardinalities, but they do not account for the total number of features and, therefore, lack a correction for chance. This correction was first introduced by Kuncheva [45], but it applied only to sets with the same cardinality. Lustgarten later extended this approach [46] to accommodate sets with different cardinalities. More recently, Nogueira and Brown [47] proposed a new feature-selection stability measure based on the frequency of selected features, which is calculated based on the Z matrix.

Z = [\begin{matrix} z_{11} & z_{12} & z_{13} & \dots & z_{1 n} \\ z_{21} & z_{22} & z_{23} & \dots & z_{2 n} \\ \dots & \dots & \dots & \dots & \dots \\ z_{k 1} & z_{k 2} & z_{k 3} & \dots & z_{k n} \end{matrix}]

(8)

Each row

Z_{k}

of Z corresponds to a feature-selection subset, while the columns represent individual features. The selection frequency of a feature f is calculated as

p_{f} = \frac{1}{k} \sum_{i = 1}^{k} z_{i f}

and the unbiased sample variance of the selection of the f-th feature is given by:

s_{f}^{2} = \frac{k}{k - 1} p_{f} (1 - p_{f}) .

The average number of selected features over k feature sets, denoted by

\bar{n}

, is:

\bar{n} = \frac{1}{k} \sum_{i = 1}^{k} \sum_{f = 1}^{n} z_{i f} .

Additionally, we define the normalized average number of selected features

p_{n}

as:

p_{n} = \frac{\bar{n}}{n} .

Finally, the Nogueira feature-stability measure

ϕ (Z)

is defined as follows:

ϕ (Z) = 1 - \frac{\frac{1}{n} \sum_{f = 1}^{n} s_{f}^{2}}{p_{n} (1 - p_{n})}

(9)

As demonstrated by Nogueira and Brown [47], this measure satisfies all five desired properties, while other measures fail to meet some of them. For this reason, we chose to utilize this measure in our research.

Fully defined. It should also measure stability for subsets with different cardinalities.
Correction for chance. Subsets of features can share some of the features even if selected randomly.
Bounds. The stability $ϕ (Z)$ (9) should be upper/lower bounded by constants not dependent on the overall number of features or the number of features selected.
Maximum stability. A measure should achieve its maximum if and only if all feature sets in Z are identical.
Strict monotonicity. The stability estimator $ϕ (Z)$ (9) should be a strictly decreasing function of the sample variances $s_{f}^{2}$ of the variables $Z_{f}$ (column in the matrix Z).

2.3. Datasets

In our study, we utilized the same datasets as those described in the benchmark study of feature selection by Li et al. [24], as we aimed to enhance their results by measuring feature-selection stability. Specifically, we selected the same 15 cancer datasets from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov, accessed on 30 October 2024) [8]. The selection criteria stipulated that datasets must contain at least 100 samples, have no missing observations, include the outcome variable for TP53 mutations, and not have more than 90% of observations in a single class. From these datasets, we included four omics layers: cnv (CNV, copy number variation), mirna (miRNA, micro ribonucleic acid), mutation, and rna (RNA, ribonucleic acid). The methylation block was excluded due to its substantial size, which would have resulted in excessive computational times.

These datasets were also previously studied by Herrmann et al. [48], who selected 18 datasets and considered the outcome variable to be the (censored) survival time. Various prediction methods from both machine-learning and statistical approaches demonstrated the limited utility of multi-omics data for this task.

An overview of the 15 included datasets is provided in Table 1. Following Li et al. [24], we predict the presence of TP53 mutations, which have been associated with poor clinical outcomes in cancer patients [49].

2.4. Experimental Setup

For each dataset, we conducted experiments separately on individual omics features (cnv, mirna, mutation, and rna) as well as on the combined feature set. Each experiment was conducted using 5-fold cross-validation, repeated 3 times with different random seeds. For each data subset (omics layers or the combined feature set) and each classifier type (LR, SVM, Lasso), we selected the regularization parameter C prior to the cross-validation loops. This parameter was adjusted to obtain feature subsets of low, medium, and high cardinality, corresponding to approximately 7%, 30%, and 55% of the total number of observations in the datasets, respectively. Details about the algorithm for choosing the C value are described in the next paragraph. Using classifiers with fixed values for the regularization parameters C, we fitted classification models to the training data and extracted the selected features from the fitted models (those with weights

w_{i} \neq 0

). The selected features were saved for each split, resulting in a total of 15 subsets of features (5 folds × 3 repetitions) for each classifier and each C value across all omics in the dataset. These subsets were then utilized to calculate feature-selection stability. Additionally, for each of these runs, we computed the accuracy on the test set, as shown in Figure 1.

The experiments were conducted in the Python 3.9 programming environment. In most cases, including procedures for generating training and test dataset splits as well as the classification methods, implementations from the scikit-learn [50] library were utilized. Detailed implementation information is provided in the source code available in the GitHub repository (https://github.com/tlukaszuk/feature-selection-stability-on-multi-omics-data (accessed on 26 November)).

An important part of our computational procedure was selecting the value of parameter C to ensure that the classifier utilizes the desired number of features. It should be noted, however, that in practice, it is not possible to select C to produce a classifier model that uses exactly the desired number of features. The number of features used by the classifier is influenced by the subset of data on which it is fitted. Nonetheless, it can be expected that classifier models fitted on similarly large subsets of the same dataset, with the same C value, will use a comparable number of features. In practice, we may find that these feature counts do not vary by more than 5%.

Based on the above, the determination of parameter C is achieved by taking as input the type of classifier (LR, SVM, or Lasso), the desired number of features

n f

for the fitted classifier model, and the dataset

D = {x_{j}, y_{j}}

(or its omics layer). The computational procedure employs a gradual narrowing of the range of C values, training successive classifier models and checking the resulting feature subset size. This process continues until a classifier model is obtained that, for the tested parameter

C = C_{o p t}

, uses approximately the desired number of features

n f

(within a tolerance of 5%). The value

C_{o p t}

is then returned as the determined C value, ensuring that classifier models fitted on similarly large subsets of the same dataset D will use approximately

n f

features.

In our computational procedure, the determination of C-parameter values is performed separately for each dataset and its subsets containing features from a single omics layer. For each dataset, we first define the desired numbers of features

n f_{l o w}

,

n f_{m e d}

and

n f_{h i g h}

that the classification models should utilize. These feature counts correspond to approximately 7%, 30%, and 55% of the 80% of the total number of observations (to match the size of the training split used during 5-fold cross-validation).

The C-parameter selection is carried out independently for each classifier type (SVM, LR, Lasso), and each desired feature count

n f_{l o w}

,

n f_{m e d}

,

n f_{h i g h}

. Additionally, a random 80% split of the observation set is used during the C-parameter determination procedure to reflect the training set size employed in the main computational workflow.

As a result, different sets of C-parameter values (low, med, and high) are used for each classifier and each dataset or its omics subsets.

2.5. Evaluation of the Classification Models

Classification accuracy is a metric that measures how often a model correctly classifies instances in a dataset. It is calculated as the ratio of correctly predicted instances to the total number of instances. High accuracy indicates that the model’s predictions closely match the actual labels, which is generally desirable.

Classification accuracy is a widely used evaluation metric. However, for unbalanced datasets, where one class has significantly more observations than the other, classification accuracy can be misleading. High accuracy values may not reflect good model quality, as the model might only perform well for the majority class. In our study, several datasets are unbalanced. To address this, we use balanced classification accuracy, which accounts for the number of observations in each class in the evaluated sample. Throughout this paper, any reference to classification accuracy refers specifically to balanced classification accuracy.

In the computational procedure, we used 5-fold cross-validation, where 4/5 of the observations were used to fit the classifier model, and the remaining 1/5 were used for evaluation. The training and evaluation process was repeated multiple times, each with a different split between training and testing, following cross-validation guidelines. From our perspective, when evaluating a classifier, it is more important to consider the classification accuracy on the test subset as an unbiased estimate, reflecting the expected performance on new observations when the models are applied in practice.

The second parameter we present from our experiments is the stability of feature selection, expressed using the Nogueira measure (Equation (9)). This stability was evaluated based on the compositions of feature subsets selected by classifier models trained on 15 dataset splits (using 3 repetitions of 5-fold cross-validation) as shown in Figure 1. A higher stability value indicates greater reproducibility of the feature subset compositions and stronger independence of the classifier model from variations in the training set, which is beneficial in practical applications.

Recognizing that both feature-selection stability and classification accuracy are crucial for assessing the quality of a predictive model, we propose combining these two parameters into a single metric using a weighted harmonic mean

F_{β}

.

F_{β} = \frac{1 + β}{\frac{1}{a c c} + \frac{β}{f s s}}

(10)

where

a c c

denotes the classification accuracy value, and

f s s

denotes the feature-selection stability value obtained for the tested classifier model. The parameter

β

is a weighting factor that determines the contribution of feature-selection stability to the

F_{β}

metric. The relative impact of the two evaluation metrics, feature-selection stability, and classification accuracy, on the calculation of

F_{β}

is controlled by the value of

β

. A

β

value of 1 indicates an equal contribution from both metrics. Values of

β > 1

increase the weight of feature-selection stability in assessing model quality, while values of

0 < β < 1

place greater emphasis on classification accuracy. The appropriate choice of

β

depends on the specific problem and the relative importance of each evaluation metric.

3. Results

This section presents and discusses the results obtained from our experiments. The focus of this presentation will be on the three aspects investigated: classification accuracy, feature-selection stability, and our proposed metric, which combines classification accuracy and feature-selection stability into a single parameter.

Given the large number of datasets used in this study, we will not present the results for each dataset individually. Instead, we will report averaged values and standard deviations for each type and kind of classifier studied, with a breakdown by omics layer. This approach will allow for more general conclusions to be drawn.

The classification accuracy values for the examined methods across different omics layers are detailed in Table 2 and Figure 2. The results are categorized by three ranges of the regularization parameter C (low, med, high).

It can be observed that for virtually every omics layer, the highest classification accuracy is achieved with classifiers of med-complexity. This suggests that moderate model complexity, defined by the number of features used, is optimal for achieving accuracy. Increasing or decreasing the number of features leads to a decline in classification accuracy.

Regarding the classification accuracy values, the highest results were seen with the __ALL__ (0.721–0.745) and rna (0.723–0.744) layers, while the lowest results were observed with the mutation layer (0.605–0.617).

Among the classifiers evaluated, the SVM model exhibited the highest overall accuracy at the med range of C with an accuracy of 0.743, while the weakest performance was observed for the mutation layer in the high range, with an accuracy of 0.605. The LR model exhibited slightly lower accuracy compared to SVM. However, similar to SVM, its best performances (0.739 and 0.738) were achieved in the med range. The achieved peak accuracies of 0.745 in the med range for the overall dataset and 0.743 for the rna omics layer, highlighting its effectiveness in this configuration. It is also worth noting that the differences in classification accuracy across classifier types (SVM, LR, Lasso) in corresponding configurations are minimal.

The stability of feature selection values, as measured by the Nogueira metric, is summarized in Table 3 and Figure 3 for the examined classification methods across different ranges of the regularization parameter C (low, med, high).

Among the classifiers evaluated, the SVM demonstrated varying stability, with the highest stability for the mirna layer at the low C range, achieving a value of 0.597. However, stability decreased significantly in the high C range, with an overall stability score of only 0.353, indicating reduced reliability in feature selection. The LR exhibited consistent performance, particularly at the low C range, where the stability for the mirna layer reached 0.590. Stability values for LR were generally lower in the med and high ranges, reflecting a consistent but less robust feature selection across the parameter spectrum. The Lasso regression also showed strong stability at the low C range, with the highest value of 0.589 for the mirna layer. However, similar to the other methods, stability diminished in the higher parameter ranges, especially noted with a score of 0.366 for the rna layer in the high range. Overall, the low C range consistently yielded better feature-selection stability across all classifiers, highlighting the trade-off between model complexity and stability in multi-omics data analysis.

The weighted harmonic mean

F_{β}

of classification accuracy and feature-selection stability values, achieved for the examined classification methods across different ranges of the regularization parameter C (low, med, high), is summarized in Table 4. This measure provides a balanced assessment of both performance metrics, with equal weighting assigned (

β = 1

).

Among the classifiers, the LR achieved the highest overall weighted harmonic mean in the low C range, with a value of 0.575, indicating strong performance in both classification accuracy and feature-selection stability. The best stability was recorded for the mirna layer, reaching 0.632, which highlights the robustness of LR in this context. The SVM also performed well in the low C range, with a weighted harmonic mean of 0.578, demonstrating solid classification accuracy, particularly for the mirna layer at 0.637. However, performance decreased in the med and high ranges, suggesting that increasing model complexity adversely affects the balance between accuracy and stability. The Lasso regression exhibited similar trends, with the best overall performance in the low C range at 0.570 and notable stability for the mirna layer at 0.632. As with the other methods, a decline in performance was observed as C increased, particularly evident in the high C range, where Lasso achieved the lowest score of 0.498. Overall, the results indicate that lower values of C tend to yield better balances between classification accuracy and feature selection stability across all classifiers, emphasizing the importance of model tuning in multi-omics analyses.

Figure 4 presents a summary of the

F_{β}

values obtained for the three different

β

-weight values. In each case, the less-complex models (low, using fewer features) exhibit higher

F_{β}

values, even when the contribution of classification accuracy is weighted five times higher than feature-selection stability.

4. Discussion

The aim of our work was not to precisely replicate the experiments presented in Li et al. [24] but to extend the evaluation metrics they used, focused on classification accuracy and related metrics, to include feature selection stability, which is increasingly recognized as an important evaluation parameter [22,36,51]. In our analysis, we used different feature-selection and classification methods than Li et al. [24] but applied them to the same datasets. We achieved similar classification accuracy values. Additionally, we demonstrated that when considering both classification accuracy and feature-selection stability, a model with different parameters can be identified as the best. Classification accuracy (or its derivative) remains a crucial parameter for evaluating the quality of a classification model, but incorporating feature-selection stability provides a more comprehensive assessment.

In our analysis, the classification accuracy across individual omic layers and the combined feature set did not exhibit significant variations. This stability in accuracy may be attributed to the inherent biological nature of cancer, which manifests across multiple omics layers. The consistent representation of cancer-related pathways and processes in genomic, transcriptomic, proteomic, and metabolomic data suggests that key signals for classification are preserved across these diverse data types. This highlights the interconnectedness of biological systems in cancer pathology and reinforces the notion that integrating multi-omics data can offer a more comprehensive understanding of the disease.

Furthermore, the robust performance across different omics layers underscores the potential of comprehensive multi-omics approaches for effective biomarker discovery. It also suggests that cancer characteristics are sufficiently prominent across all omics layers, resulting in similar predictive power. However, this consistency warrants further investigation into whether certain omics layers contribute disproportionately to classification outcomes. Such insights could inform future studies on optimal feature selection and model training in multi-omics research.

Our future work could focus on extending the validation of the presented method to high-dimensional datasets beyond the medical domain, such as those found in fields like finance, social sciences, or environmental studies. This would help to evaluate the generalizability and robustness of the method in various contexts. Additionally, developing a comprehensive framework for the automated assessment of feature-selection stability in multi-omics data is another promising direction. This framework could integrate multiple stability metrics, incorporate machine-learning algorithms tailored to different omics types, and provide visualization tools to aid in interpreting stability outcomes. Such advancements would enhance the usability and effectiveness of stability assessments in complex, high-dimensional datasets, ultimately contributing to more reliable and reproducible feature-selection processes across diverse scientific disciplines.

In concluding the discussion, it is essential to acknowledge the limitations of the presented research. We take into consideration only four omic layers, which may not capture the full complexity of biological systems. In reality, there are additional omic layers, such as methylation, which can provide valuable insights. By excluding these layers, the analysis may overlook critical interactions and regulatory mechanisms that could influence the results. Future studies should aim to include a broader range of omic layers to provide a more holistic understanding and to better reflect the intricacies of multi-omics datasets.

5. Conclusions

In conclusion, this study underscores the importance of evaluating both classification accuracy and feature-selection stability in multi-omics data analysis. By employing a range of classifiers, including Support Vector Machine, Logistic Regression, and Lasso regression, we demonstrated that the choice of regularization parameter C significantly impacts both performance metrics. Our findings indicate that lower values of C consistently yield a better balance between classification accuracy and feature selection stability, particularly for Logistic Regression and Lasso regression.

The integration of the Nogueira feature-stability measure and the weighted harmonic mean of accuracy and stability offers a comprehensive approach to assess the reliability of feature selection in multi-omics contexts. The insights gained from our analysis highlight the necessity for careful model tuning and validation to improve the robustness of biomarker discovery processes in cancer research.

Overall, this work not only contributes to the understanding of feature-selection stability but also sets the stage for future investigations aimed at enhancing predictive modeling in multi-omics studies. By prioritizing stability alongside accuracy, researchers can develop more reliable models that facilitate advancements in precision medicine and the effective treatment of complex diseases.

Author Contributions

Conceptualization, T.Ł. and J.K. (Jerzy Krawczuk); methodology, T.Ł. and J.K. (Jerzy Krawczuk); software, T.Ł. and J.K. (Jerzy Krawczuk); validation, T.Ł., J.K. (Jerzy Krawczuk), K.Ż. and J.K. (Jacek Kęsik); formal analysis, T.Ł. and J.K. (Jerzy Krawczuk); investigation, T.Ł. and J.K. (Jerzy Krawczuk); resources, T.Ł. and J.K. (Jerzy Krawczuk); data curation, T.Ł. and J.K. (Jerzy Krawczuk); writing—original draft preparation, T.Ł., J.K. (Jerzy Krawczuk), K.Ż. and J.K. (Jacek Kęsik); writing—review and editing, T.Ł., J.K. (Jerzy Krawczuk), K.Ż. and J.K. (Jacek Kęsik); visualization, T.Ł. and J.K. (Jerzy Krawczuk); supervision, T.Ł. and J.K. (Jacek Kęsik). All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by the Lublin University of Technology Scientific Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and codes to reproduce the results in the paper are available publicly on GitHub: https://github.com/tlukaszuk/feature-selection-stability-on-multi-omics-data.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LASSO or Lasso	Least Absolute Shrinkage and Selection Operator
LR	Logistic Regression
SVM	Support Vector Machines
TCGA	The Cancer Genome Atlas

References

Shahrajabian, M.H.; Sun, W. Survey on multi-omics, and multi-omics data analysis, integration and application. Curr. Pharm. Anal. 2023, 19, 267–281. [Google Scholar] [CrossRef]
Subramanian, I.; Verma, S.; Kumar, S.; Jere, A.; Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights 2020, 14, 1177932219899051. [Google Scholar] [CrossRef] [PubMed]
Canzler, S.; Schor, J.; Busch, W.; Schubert, K.; Rolle-Kampczyk, U.E.; Seitz, H.; Kamp, H.; von Bergen, M.; Buesen, R.; Hackermüller, J. Prospects and challenges of multi-omics data integration in toxicology. Arch. Toxicol. 2020, 94, 371–388. [Google Scholar] [CrossRef] [PubMed]
Michel, J.B.; Shen, Y.K.; Aiden, A.P.; Veres, A.; Gray, M.K.; Team, G.B.; Pickett, J.P.; Hoiberg, D.; Clancy, D.; Norvig, P.; et al. Quantitative analysis of culture using millions of digitized books. Science 2011, 331, 176–182. [Google Scholar] [CrossRef]
Albuquerque, U.P.; Cantalice, A.S.; Oliveira, E.S.; de Moura, J.M.B.; Dos Santos, R.K.S.; da Silva, R.H.; Brito-Júnior, V.M.; Ferreira-Júnior, W.S. Exploring large digital bodies for the study of human behavior. Evol. Psychol. Sci. 2023, 9, 385–394. [Google Scholar] [CrossRef]
Kęsik, J.; Żyła, K.; Montusiewicz, J.; Miłosz, M.; Neamtu, C.; Juszczyk, M. A methodical approach to 3d scanning of heritage objects being under continuous display. Appl. Sci. 2022, 13, 441. [Google Scholar] [CrossRef]
Wang, Z.; Jensen, M.A.; Zenklusen, J.C. A practical guide to the cancer genome atlas (TCGA). In Statistical Genomics: Methods and Protocols; Humana Press: New York, NY, USA, 2016; pp. 111–141. [Google Scholar]
NIH, TCGA Program. Available online: https://www.cancer.gov/ccg/research/genome-sequencing/tcga (accessed on 30 October 2024).
Zhang, J.; Bajari, R.; Andric, D.; Gerthoffert, F.; Lepsa, A.; Nahal-Bose, H.; Stein, L.D.; Ferretti, V. The international cancer genome consortium data portal. Nat. Biotechnol. 2019, 37, 367–369. [Google Scholar] [CrossRef]
ICGC Data Repository. Available online: https://docs.icgc-argo.org/docs/data-access/icgc-25k-data (accessed on 30 October 2024).
Perez-Riverol, Y.; Bai, M.; da Veiga Leprevost, F.; Squizzato, S.; Park, Y.M.; Haug, K.; Carroll, A.J.; Spalding, D.; Paschall, J.; Wang, M.; et al. Discovering and linking public omics data sets using the Omics Discovery Index. Nat. Biotechnol. 2017, 35, 406–409. [Google Scholar] [CrossRef]
Omics DI Homepage. Available online: https://www.omicsdi.org/ (accessed on 30 October 2024).
Kozieł, G. Information security policy creating. Actual Probl. Econ. 2011, 126, 376–380. [Google Scholar]
Kang, M.; Ko, E.; Mersha, T.B. A roadmap for multi-omics data integration using deep learning. Briefings Bioinform. 2022, 23, bbab454. [Google Scholar] [CrossRef]
He, X.; Liu, X.; Zuo, F.; Shi, H.; Jing, J. Artificial intelligence-based multi-omics analysis fuels cancer precision medicine. Semin. Cancer Biol. 2023, 88, 187–200. [Google Scholar] [CrossRef]
Chakraborty, S.; Sharma, G.; Karmakar, S.; Banerjee, S. Multi-OMICS approaches in cancer biology: New era in cancer therapy. Biochim. Biophys. Acta (BBA)-Mol. Basis Dis. 2024, 1870, 167120. [Google Scholar] [CrossRef] [PubMed]
Alkhateeb, A.; Rueda, L. Machine Learning Methods for Multi-Omics Data Integration; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
Ahmed, Z.; Wan, S.; Zhang, F.; Zhong, W. Artificial intelligence for omics data analysis. BMC Methods 2024, 1, 4. [Google Scholar] [CrossRef]
Wekesa, J.S.; Kimwele, M. A review of multi-omics data integration through deep learning approaches for disease diagnosis, prognosis, and treatment. Front. Genet. 2023, 14, 1199087. [Google Scholar] [CrossRef] [PubMed]
Leng, D.; Zheng, L.; Wen, Y.; Zhang, Y.; Wu, L.; Wang, J.; Wang, M.; Zhang, Z.; He, S.; Bo, X. A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol. 2022, 23, 171. [Google Scholar] [CrossRef]
Labory, J.; Bottini, S. The multiomics revolution in the era of deep learning: Allies or enemies. In Artificial Intelligence for Medicine; Elsevier: Amsterdam, The Netherlands, 2024; pp. 201–216. [Google Scholar]
Pes, B. Ensemble feature selection for high-dimensional data: A stability analysis across multiple domains. Neural Comput. Appl. 2020, 32, 5951–5973. [Google Scholar] [CrossRef]
ukaszuk, T.; Krawczuk, J. Importance of feature selection stability in the classifier evaluation on high-dimensional genetic data. PeerJ 2024, 12, e18405. [Google Scholar] [CrossRef]
Li, Y.; Mansmann, U.; Du, S.; Hornung, R. Benchmark study of feature selection strategies for multi-omics data. BMC Bioinform. 2022, 23, 412. [Google Scholar] [CrossRef]
Momeni, Z.; Hassanzadeh, E.; Abadeh, M.S.; Bellazzi, R. A survey on single and multi omics data mining methods in cancer data classification. J. Biomed. Inform. 2020, 107, 103466. [Google Scholar] [CrossRef]
Al-Tashi, Q.; Abdulkadir, S.J.; Rais, H.M.; Mirjalili, S.; Alhussian, H. Approaches to multi-objective feature selection: A systematic literature review. IEEE Access 2020, 8, 125076–125096. [Google Scholar] [CrossRef]
Peck, R.; Devore, J. Statistics: The exploration and analysis of data. Cengage Learn. 2011, 464465, 516–519. [Google Scholar]
Gao, L.; Ye, M.; Lu, X.; Huang, D. Hybrid method based on information gain and support vector machine for gene selection in cancer classification. Genom. Proteom. Bioinform. 2017, 15, 389–395. [Google Scholar] [CrossRef] [PubMed]
Kononenko, I.; Šimec, E.; Robnik-Šikonja, M. Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 1997, 7, 39–55. [Google Scholar] [CrossRef]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Oreski, S.; Oreski, G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst. Appl. 2014, 41, 2052–2064. [Google Scholar] [CrossRef]
Tibshirani, R. The lasso method for variable selection in the Cox model. Stat. Med. 1997, 16, 385–395. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
Barbieri, M.C.; Grisci, B.I.; Dorn, M. Analysis and comparison of feature selection methods towards performance and stability. Expert Syst. Appl. 2024, 249, 123667. [Google Scholar] [CrossRef]
Lazebnik, T.; Rosenfeld, A. A new definition for feature selection stability analysis. Ann. Math. Artif. Intell. 2024, 92, 753–770. [Google Scholar] [CrossRef]
Mohtasham, F.; Pourhoseingholi, M.; Hashemi Nazari, S.S.; Kavousi, K.; Zali, M.R. Comparative analysis of feature selection techniques for COVID-19 dataset. Sci. Rep. 2024, 14, 18627. [Google Scholar] [CrossRef]
Hayet-Otero, M.; Garcia-Garcia, F.; Lee, D.J.; Martínez-Minaya, J.; España Yandiola, P.P.; Urrutia Landa, I.; Nieves Ermecheo, M.; Quintana, J.M.; Menéndez, R.; Torres, A.; et al. Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques. PLoS ONE 2023, 18, e0284150. [Google Scholar] [CrossRef] [PubMed]
Cortes, C. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Bartlett, P.L.; Wegkamp, M.H. Classification with a Reject Option using a Hinge Loss. J. Mach. Learn. Res. 2008, 9, 1823–1840. [Google Scholar]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Ranstam, J.; Cook, J.A. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Saeys, Y.; Abeel, T.; Van de Peer, Y. Robust feature selection using ensemble feature selection techniques. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2008, Antwerp, Belgium, 15–19 September 2008; Proceedings, Part II 19; Springer: Berlin/Heidelberg, Germany, 2008; pp. 313–325. [Google Scholar]
Kuncheva, L. A stability index for feature selection. In Proceedings of the Artificial Intelligence and Applications, AIAP’07, Innsbruck, Austria, 12–14 February 2007; ACTA Press: Anaheim, CA, USA, 2007; pp. 390–395. [Google Scholar]
Lustgarten, J.; Gopalakrishnan, V.; Visweswaran, S. Measuring stability of feature selection in biomedical datasets. In Proceedings of the AMIA Annual Symposium Proceedings, San Francisco, CA, USA, 14–18 November 2009; American Medical Informatics Association: Bethesda, MD, USA, 2009; Volume 2009, pp. 406–410. [Google Scholar]
Nogueira, S.; Sechidis, K.; Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
Herrmann, M.; Probst, P.; Hornung, R.; Jurinovic, V.; Boulesteix, A.L. Large-scale benchmark study of survival prediction methods using multi-omics data. Briefings Bioinform. 2021, 22, bbaa167. [Google Scholar] [CrossRef]
Wang, X.; Sun, Q. TP53 mutations, expression and interaction networks in human cancers. Oncotarget 2017, 8, 624. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Kalousis, A.; Prados, J.; Hilario, M. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowl. Inf. Syst. 2007, 12, 95–116. [Google Scholar] [CrossRef]

Figure 1. Experimental setup. For each omics layer and all omics, 5-fold stratified cross-validation was repeated 3 times. Each of 3 classifiers was trained using three different C values (

C_{l o w}

,

C_{m e d}

and

C_{h i g h}

). For each classifier, both feature selection stability and accuracy were calculated.

Figure 1. Experimental setup. For each omics layer and all omics, 5-fold stratified cross-validation was repeated 3 times. Each of 3 classifiers was trained using three different C values (

C_{l o w}

,

C_{m e d}

and

C_{h i g h}

). For each classifier, both feature selection stability and accuracy were calculated.

Figure 2. Classification accuracy values on the test part of the dataset achieved for the examined classification methods in the three ranges of C parameter values. The central dots represent the mean values calculated from the results across 15 datasets and 15 spits of each dataset. The vertical lines represent the standard deviation values.

Figure 3. The values of feature-selection stability, according to the Nogueira measure, obtained for the tested classification methods across three ranges of C parameter values. The central dots represent the mean values calculated from the results across 15 datasets. The vertical lines represent the standard deviation values.

Figure 4. Values of the weighted harmonic mean

F_{β}

of classification accuracy and feature selection stability achieved for the tested classification methods in the three ranges of C parameter values. The central dots represent the mean values calculated from the results across 15 datasets. The vertical lines represent the standard deviation values. In the first graph, classification accuracy and feature-selection stability have equal contributions to the harmonic mean (

β = 1

). In the second graph, stability is weighted 5 times more than classification accuracy (

β = 5

). In the third graph, classification accuracy is weighted 5 times more than feature-selection stability (

β = 0.2

).

Figure 4. Values of the weighted harmonic mean

F_{β}

of classification accuracy and feature selection stability achieved for the tested classification methods in the three ranges of C parameter values. The central dots represent the mean values calculated from the results across 15 datasets. The vertical lines represent the standard deviation values. In the first graph, classification accuracy and feature-selection stability have equal contributions to the harmonic mean (

β = 1

). In the second graph, stability is weighted 5 times more than classification accuracy (

β = 5

). In the third graph, classification accuracy is weighted 5 times more than feature-selection stability (

β = 0.2

).

Table 1. Summary of the datasets used in the experiments. The “Split” column indicates the number of objects without TP53 mutations and those with the mutation, respectively. The “__ALL__” column represents the total number of available features, while the columns on the right display the number of features for each omics layer.

Name	Objects	Split 0/1	__ALL__	cnv	mirna	mutation	rna
BLCA	382	196/186	100,451	57,964	825	18,576	23,081
BRCA	735	480/255	99,475	57,964	835	17,974	22,694
COAD	191	85/106	99,520	57,964	802	18,537	22,210
ESCA	106	23/83	96,854	57,964	763	12,627	25,494
HNSC	443	136/307	97,535	57,964	793	17,247	21,520
LGG	419	224/195	90,150	57,964	645	9234	22,297
LIHC	159	115/44	91,565	57,964	776	11,820	20,994
LUAD	426	214/212	100,840	57,964	799	18,387	23,681
LUSC	418	72/346	100,891	57,964	895	18,499	23,524
PAAD	124	46/78	93,325	57,964	612	12,391	22,348
PRAD	407	359/48	91,984	57,925	585	11,701	21,769
SARC	126	78/48	91,595	57,964	778	10,000	22,842
SKCM	249	210/39	99,815	57,964	1002	18,592	22,248
STAD	295	156/139	103,365	57,964	787	18,580	26,027
UCEC	405	261/144	103,354	57,447	866	21,052	23,978

Table 2. Classification accuracy values (means and standard deviations) on the test part of the dataset achieved for the examined classification methods in the three ranges of C parameter values (low, med, high).

Classifier	Omics Layer
Classifier	__ALL__	cnv	mirna	mutation	rna
SVM_low	0.724 ± 0.061	0.663 ± 0.070	0.686 ± 0.065	0.612 ± 0.063	0.729 ± 0.059
SVM_med	0.743 ± 0.061	0.678 ± 0.064	0.685 ± 0.063	0.616 ± 0.056	0.740 ± 0.062
SVM_high	0.721 ± 0.066	0.661 ± 0.069	0.655 ± 0.061	0.605 ± 0.062	0.723 ± 0.064
LR_low	0.723 ± 0.062	0.661 ± 0.070	0.685 ± 0.066	0.612 ± 0.065	0.728 ± 0.060
LR_med	0.739 ± 0.064	0.677 ± 0.066	0.685 ± 0.063	0.615 ± 0.058	0.738 ± 0.065
LR_high	0.728 ± 0.063	0.668 ± 0.065	0.661 ± 0.060	0.606 ± 0.058	0.728 ± 0.065
Lasso_low	0.727 ± 0.057	0.684 ± 0.070	0.684 ± 0.064	0.608 ± 0.063	0.732 ± 0.056
Lasso_med	0.745 ± 0.062	0.690 ± 0.059	0.696 ± 0.059	0.617 ± 0.058	0.743 ± 0.060
Lasso_high	0.742 ± 0.061	0.681 ± 0.069	0.678 ± 0.064	0.606 ± 0.058	0.744 ± 0.059

The best results are highlighted in green, while the weakest results are highlighted in red.

Table 3. The values of feature-selection stability (means and standard deviations), as measured by the Nogueira metric, obtained for the classification methods across three ranges of C parameter values (low, med, high).

Classifier	Omics Layer
Classifier	__ALL__	cnv	mirna	mutation	rna
SVM_low	0.482 ± 0.086	0.449 ± 0.079	0.597 ± 0.091	0.420 ± 0.093	0.470 ± 0.085
SVM_med	0.396 ± 0.035	0.381 ± 0.039	0.470 ± 0.047	0.398 ± 0.045	0.387 ± 0.035
SVM_high	0.353 ± 0.033	0.364 ± 0.030	0.372 ± 0.046	0.363 ± 0.036	0.345 ± 0.033
LR_low	0.479 ± 0.082	0.453 ± 0.080	0.590 ± 0.087	0.419 ± 0.093	0.464 ± 0.089
LR_med	0.397 ± 0.035	0.385 ± 0.041	0.472 ± 0.045	0.401 ± 0.050	0.390 ± 0.034
LR_high	0.384 ± 0.033	0.372 ± 0.031	0.394 ± 0.041	0.383 ± 0.028	0.376 ± 0.037
Lasso_low	0.470 ± 0.089	0.447 ± 0.080	0.589 ± 0.080	0.424 ± 0.093	0.462 ± 0.091
Lasso_med	0.401 ± 0.044	0.376 ± 0.042	0.488 ± 0.033	0.399 ± 0.050	0.395 ± 0.037
Lasso_high	0.376 ± 0.033	0.368 ± 0.030	0.409 ± 0.041	0.374 ± 0.036	0.366 ± 0.030

The best results are highlighted in green, while the weakest results are highlighted in red.

Table 4. Values (means and standard deviations) of the weighted harmonic mean

F_{β}

of classification accuracy and feature-selection stability achieved for the examined classification methods in the three ranges of C parameter values (low, med, and high). The values presented in the table were determined for

β = 1

, indicating equal weighting of classification accuracy and feature-selection stability.

Table 4. Values (means and standard deviations) of the weighted harmonic mean

F_{β}

of classification accuracy and feature-selection stability achieved for the examined classification methods in the three ranges of C parameter values (low, med, and high). The values presented in the table were determined for

β = 1

, indicating equal weighting of classification accuracy and feature-selection stability.

Classifier	Omics Layer
Classifier	__ALL__	cnv	mirna	mutation	rna
SVM_low	0.578 ± 0.096	0.528 ± 0.062	0.637 ± 0.093	0.495 ± 0.092	0.570 ± 0.094
SVM_med	0.515 ± 0.049	0.485 ± 0.049	0.553 ± 0.043	0.482 ± 0.054	0.507 ± 0.048
SVM_high	0.473 ± 0.044	0.466 ± 0.036	0.471 ± 0.040	0.452 ± 0.044	0.465 ± 0.044
LR_low	0.575 ± 0.093	0.530 ± 0.063	0.632 ± 0.090	0.494 ± 0.093	0.565 ± 0.097
LR_med	0.515 ± 0.050	0.488 ± 0.050	0.555 ± 0.041	0.483 ± 0.059	0.509 ± 0.049
LR_high	0.502 ± 0.051	0.475 ± 0.039	0.490 ± 0.037	0.468 ± 0.043	0.495 ± 0.051
Lasso_low	0.570 ± 0.099	0.535 ± 0.065	0.632 ± 0.091	0.496 ± 0.093	0.565 ± 0.099
Lasso_med	0.520 ± 0.059	0.484 ± 0.048	0.572 ± 0.046	0.482 ± 0.057	0.515 ± 0.052
Lasso_high	0.498 ± 0.046	0.475 ± 0.037	0.507 ± 0.039	0.461 ± 0.046	0.489 ± 0.045

The best results are highlighted in green, while the weakest results are highlighted in red.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Łukaszuk, T.; Krawczuk, J.; Żyła, K.; Kęsik, J. Stability of Feature Selection in Multi-Omics Data Analysis. Appl. Sci. 2024, 14, 11103. https://doi.org/10.3390/app142311103

AMA Style

Łukaszuk T, Krawczuk J, Żyła K, Kęsik J. Stability of Feature Selection in Multi-Omics Data Analysis. Applied Sciences. 2024; 14(23):11103. https://doi.org/10.3390/app142311103

Chicago/Turabian Style

Łukaszuk, Tomasz, Jerzy Krawczuk, Kamil Żyła, and Jacek Kęsik. 2024. "Stability of Feature Selection in Multi-Omics Data Analysis" Applied Sciences 14, no. 23: 11103. https://doi.org/10.3390/app142311103

APA Style

Łukaszuk, T., Krawczuk, J., Żyła, K., & Kęsik, J. (2024). Stability of Feature Selection in Multi-Omics Data Analysis. Applied Sciences, 14(23), 11103. https://doi.org/10.3390/app142311103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stability of Feature Selection in Multi-Omics Data Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Classifiers with Embedded Feature Selection

2.1.1. Support Vector Machines (SVM)

2.1.2. Logistic Regression (LR)

2.1.3. Lasso Regression

2.2. Feature-Selection Stability Measure

2.3. Datasets

2.4. Experimental Setup

2.5. Evaluation of the Classification Models

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI