PFA-Nipals: An Unsupervised Principal Feature Selection Based on Nonlinear Estimation by Iterative Partial Least Squares

Castillo-Ibarra, Emilio; Alsina, Marco A.; Astudillo, Cesar A.; Fuenzalida-Henríquez, Ignacio

doi:10.3390/math11194154

Open AccessArticle

PFA-Nipals: An Unsupervised Principal Feature Selection Based on Nonlinear Estimation by Iterative Partial Least Squares

by

Emilio Castillo-Ibarra

¹

,

Marco A. Alsina

²

,

Cesar A. Astudillo

³

and

Ignacio Fuenzalida-Henríquez

^4,*

¹

Engineering Systems Doctoral Program, Faculty of Engineering, Universidad de Talca, Campus Curicó, Curicó 3340000, Chile

²

Faculty of Engineering, Architecture and Design, Universidad San Sebastian, Bellavista 7, Santiago 8420524, Chile

³

Department of Computer Science, Faculty of Engineering, University of Talca, Campus Curicó, Curicó 3340000, Chile

⁴

Building Management and Engineering Department, Faculty of Engineering, University of Talca, Campus Curicó, Curicó 3340000, Chile

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(19), 4154; https://doi.org/10.3390/math11194154

Submission received: 12 August 2023 / Revised: 15 September 2023 / Accepted: 17 September 2023 / Published: 3 October 2023

(This article belongs to the Special Issue Mathematical Method and Application of Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Unsupervised feature selection (UFS) has received great interest in various areas of research that require dimensionality reduction, including machine learning, data mining, and statistical analysis. However, UFS algorithms are known to perform poorly on datasets with missing data, exhibiting a significant computational load and learning bias. In this work, we propose a novel and robust UFS method, designated PFA-Nipals, that works with missing data without the need for deletion or imputation. This is achieved by considering an iterative nonlinear estimation of principal components by partial least squares, while the relevant features are selected through minibatch K-means clustering. The proposed method is successfully applied to select the relevant features of a robust health dataset with missing data, outperforming other UFS methods in terms of computational load and learning bias. Furthermore, the proposed method is capable of finding a consistent set of relevant features without biasing the explained variability, even under increasing missing data. Finally, it is expected that the proposed method could be used in several areas, such as machine learning and big data with applications in different areas of the medical and engineering sciences.

Keywords:

unsupervised feature selection; Nipals; clustering; missing data; interpretability

MSC:

62H30; 68T10

1. Introduction

Dimensionality reduction refers to the transformation of a high-dimensionality dataset into another dataset of lower dimensionality that preserves most of the information content [1,2]. When the transformation maps the original dataset into a transformed space, the corresponding reduction in dimensionality is known as feature extraction. On the other hand, when the transformation seeks to select a subset of relevant features from the original dataset, the reduction is known as feature selection [3].

Algorithms for feature extraction are particularly attractive for machine learning, since the derived features can be constructed to maximize information content while avoiding redundancy, therefore improving the performance of the learning step [3]. For instance, a principal components analysis (PCA) constructs a set of linearly transformed features that are orthogonal and ranked based on the explained variability of the original dataset [3].

There are several instances where feature selection is preferred over feature extraction: the interpretability of the features to understand the relationship between different variables; computational efficiency, where feature selection allows faster computing times; to reduce overfitting when the datasets used are small; domain knowledge of some important features and allowing them to be selected first; finally, when a high correlation between features is observed, feature selection can identify the most relevant ones and therefore reduce dimensionality.

Therefore, feature selection is an important preprocessing step for machine learning applications, since it removes features that are either redundant or irrelevant, reducing the computational load, it enhances the interpretation of results, and it allows the analysis of datasets that would be computationally unmanageable otherwise [1,4,5,6]. Methods for feature selection can be categorized as supervised or unsupervised, depending on whether or not a learning algorithm is available to provide feedback on the selection of relevant features [1].

In particular, unsupervised feature selection (UFS) algorithms have received considerable attention in research areas where datasets with thousands of features are present, such as pattern recognition, machine learning, data mining, and statistical analysis [7].

Since unsupervised data in the real world may contain irrelevant or redundant features that can affect the analysis and lead to biases or incorrect results [1], it becomes crucial to eliminate unnecessary features to improve computational efficiency and enhance the robustness of simple models on small datasets due to their lower variation in sample details [1,2]. By explaining the data with fewer features, we gain a deeper understanding of the underlying process, enabling the extraction of valuable knowledge [2]. This becomes especially relevant when seeking to understand the behavior of specific systems or phenomena [8].

In the context of missing data, the problem can be categorized into three types: coverage errors, total nonresponse, and partial nonresponse [9]. Addressing coverage errors is crucial during sample selection, while total nonresponse requires attention during the data collection phase. On the other hand, partial nonresponse can be effectively handled during the analysis phase.

Different strategies have been proposed to deal with missing data, including imputation and deletion. For imputation, the following techniques can be considered:

Consideration of central tendency metrics (e.g., mean, mode or median) [10].
Regression techniques [11].
K-nearest neighbors algorithms [12].
Expectation–maximization algorithms [13].
Neural networks [12].

Although imputation methods preserve the original dataset size, they introduce bias in the estimators and often require the fulfillment of certain assumptions for the data distribution. Regarding deletion, the elimination from the dataset of an entire observation that presents missing data (i.e., listwise deletion) is frequently used. Issues associated with deletion include a reduction in the total variability, the elimination of possibly important observations, and a decrease in the power of statistical significance tests [12,14].

This study proposes a novel UFS method based on an iterative nonlinear estimation of components by partial least squares, in which feature weights are calculated for each component and clustered considering a minibatch approach in order to reduce calculation time. These minibatches are subsets of the weights of each feature with respect to the components, finally delivering the cluster of similar features, and selecting the representative of each cluster. Our feature selection method offers a low computational load while preserving most of the original variance, working with missing data without the need for imputation or deletion.

This paper is organized as follows: the remaining portion of the introduction briefly summarizes the state of the art of UFS methods, specifically those based on a principal component analysis; Section 2 details the proposed method, while Section 3 describes the data and the experimental methodology. Subsequently, Section 4 presents and discusses the obtained results, while Section 5 presents the overall discussion and conclusions of this work.

Survey of UFS Methods

In the literature survey regarding the UFS methods, the terms of feature selection and feature extraction for unsupervised features are sometimes understood as the same. However, they are distinct: feature selection refers to the selection of a subset of features from the original space, while the feature extraction technique selects features in a transformed space, which do not have a clear physical meaning [3].

Principal component analysis (PCA) is frequently used as a feature extraction algorithm since it forms linear combinations of all available features. However, these features contemplated by each PCA does not have equal importance in the formation of PCs necessarily, since some features may be critical and others irrelevant or insignificant to the overall analysis [3].

Some authors linked PCs to a subset of the original features by selecting critical variables or eliminating redundant, irrelevant, or insignificant variables. In this sense, B2 (backward elimination) and B4 (forward binding) algorithms are probably the best-known approaches of this type [15]. The B2 algorithm discards variables that are highly associated with the last PCs, while B4 algorithm selects variables that are highly associated with the first PCs.

As shown in [3], fundamental problems are solved: the first is based on finding a UFS algorithm with low computational time and the second is based on giving interpretability to the PCA. For this, the author relies on the evaluation of features for their ability to reproduce the projections on the main axes, by using regression through ordinary least squares (OLS) and by selecting forward features and eliminating backward features.

On the other hand, a new hybrid approach to PC-based unsupervised gene selection is developed in [16], which is divided into two steps: the first retrieves subsets of genes with an original physical significance based on their capabilities to reproduce sample projections into principal components by applying the evaluation through an OLS estimation, and the second step looks for the best subsets of genes that maximize the performance of the cluster. In [7], a moving range threshold is proposed, which is based on quality control and control charts to better identify the significant variables of the dataset.

On the other hand, obtaining the eigenvectors to identify critical original features based on the nearest-neighbor rule to find the predominant features is proposed in [17].

In [18], a method called principal feature analysis (PFA) is proposed. This method uses PCs to calculate the eigenvalues and eigenvectors, which are grouped using k-means and return the feature closest to the center.

Later, a convex scattered principal component algorithm (CSPCA) is proposed in [19], applied to feature learning. It is shown that PCA can be formulated as a robust low-rank regression optimization problem for outliers. They also mention that the importance of each feature can be analyzed effectively according to the PCA criteria, since the new objective function is convex, and they propose an iterative algorithm to optimize it, generating weights for each original feature.

On the other hand, an improved and unsupervised difference algorithm called sparse difference embedding (SDE) to reduce the dimensionality of data with high numbers of features and few observations was developed in [20]. SDE seeks to find a set of projections that not only affects the locality of the intraclass and maximizes the globality of the interclass but also simultaneously applies the lasso regression to obtain a sparse transformation matrix.

An algorithm coupling a PCA regularization with a sparse feature selection model to preserve the maximum variance of the data is proposed in [21], where an iterative minimization optimization is designed in order to optimize an objective function. In this way, it looks for a significant adjustment result and the selection of informative features, respectively.

On the other hand, another type of feature selection algorithm, called the globally sparse probabilistic PCA (GSPPCA) was generated in [22], which is based on a Bayesian procedure that allows the obtention of several sparse components with the same scatter pattern to identify relevant features. This selection of features by GSPPCA is achieved using Roweis’s probabilistic interpretation in PCA and isotropic Gaussian functions in the load matrix, providing calculations of the marginal probability of a Bayesian PCA model.

However, other types of unsupervised feature selection methods exist, based on the usage of filters and not necessarily by means of a principal component analysis. In [23], unsupervised feature selection methods based on filters received more attention due to their efficiency, scalability, and simplicity. Therefore, the most recent articles are presented as follows:

In [24], a method called FSFS (feature selection using feature similarity) is proposed, with the primary objective of reducing the redundancy between the characteristics of a dataset by measuring the dependency or statistical similarity between them using the variance and covariance of the features to be later grouped in clusters. Features with similar properties are located in the same cluster. Finally, the feature selection is performed iteratively using the kNN principle (k-nearest neighbors). At each iteration (one for each feature), FSFS selects one feature of each cluster to create a final subset of features.

In [25], a method called MCFS (multicluster feature selection) is proposed, with the objective of selecting subsets of features from a dataset, preserving the overall structure of the clustered data. MCFS is an algorithm that uses spectral analysis and regression with the

L_{1}

norm to measure the importance of the features, considering the local data structure to later select a subset of features based on the regression coefficients. The final objective is to maximize the preservation of the internal data structure of the selected subset of features.

In [26], a method called UDFS (unsupervised discriminative feature selection) is proposed, with the objective of selecting a subset of features that discriminates among the data to perform regression and classification tasks. The algorithm uses discriminant information from the sparse matrices and correlates the features to assign weights to each characteristic and select the most relevant ones. This facilitates the overall efficiency and precision of the automatic learning models by reducing the number of features while maintaining the most informative attributes.

In [27], a method called NDFS (nonnegative discriminative feature selection) is presented, which is a technique for feature selection combining different steps. First, it uses spectral analysis to learn pseudoclass labels, representing the relations between the data and their features. Then, a regression model with a norm

L_{2, 1}

regularization is performed to be optimized using a specific solver. Finally, p features are selected that relate the best with the previously learned pseudoclass labels. This process allows the authors to identify a set of relevant and discriminant features to solve classification problems.

In [28], a method called DSRMR (dual self-representation and manifold regularization) is proposed. It is a learning algorithm based on a representation of the features and has the objective of selecting an optimal set of relevant features for a learning problem. DSRMR achieves this by considering three fundamental aspects: a representation of the features using an

L_{2, 1}

norm, the local geometrical structure of the original data, and a weighted representation of each feature based on its relative importance. The optimization is performed efficiently using an iterative algorithm and the final results are based on the resulting weighted features.

In [29], an innovative method for feature selection is proposed based on the sample correlations and dependencies between features. This model is composed of two main elements: to preserve the global and local structure of the dataset and to consider the global and local information. This is achieved by using mutual information and learning techniques, such as dynamical graphs and low-dimensionality restrictions to obtain reliable information and a correlation of the feature samples.

Finally, it is worth mentioning that in [30], a method was proposed using two convolutional neural networks (CNN) to extract the principal features of different datasets, obtaining excellent results in comparison to other state-of-the-art methods. This method was tested using IoT devices to study signals and identify them as malicious or normal.

All these algorithms mentioned above propose to take PCs as the central core to select relevant features, either as a means of feature selection or as a means of interpretability for PCs. However, none of them solve the problem of missing data, having to perform a preprocessing of the datasets to correct them.

2. Description of the Proposed Method: PFA-Nipals

This study introduces a new approach to feature selection, named principal feature analysis based on nonlinear iterative partial least squares or PFA-Nipals. This algorithm was developed to select attributes or features considering unsupervised data using filters, allowing the identification of the most significant variables that also explain the most sources for variability among the data. This approach is especially valuable in that it can adapt effectively to situations where missing data are observed without the need to perform an imputation of them. This algorithm is considered with a filter approach, selecting the most relevant features by examining the data without algorithms performing a particular search.

The algorithm works by selecting variables based on the desired number of features k, and creates k groups that bring together variables that share similar behavior patterns or variability within a transformed mathematical space by means of orthogonal projections. Then, the most representative variable is selected for each group.

The algorithm can be described in three large steps:

Dataset rank and orthogonal projection transformation.
Groups of features with equal variance.
Selection of the representative variables for each group.

2.1. Dataset Rank and Orthogonal Projection Transformation

Let

X_{n, d}

be a set of data with n observations and d dimensions of rank a. The value of a is used to identify a eigenvalues different from zero in the dataset. To calculate the rank, the average ranking algorithm is used.

When the method is applied to datasets with missing data, the missing values are considered as separate elements, and a special rank is usually considered and typically denoted as

N a N

. These missing data are treated as values that do not interact with the rank assignments to the rest of the nonmissing elements.

Therefore, for the

X_{n, d}

dataset, the orthogonal projection based on the principal component analysis is used, described in [31,32] and is presented as follows:

Fundamentally, a singular value decomposition (SVD) of a data matrix is carried out by means of iterative sequences of orthogonal projections [32], using a centered and reduced dataset, defined as

X = x_{i, j}

with rank a, n corresponds to the number of observations and p the number of features. Hence, the decomposition of the matrix X can be described as follows:

X = \sum_{h = 1}^{a} t_{h} p_{h}^{^{'}}

(1)

where

t_{h} = {(t_{h 1}, . . ., t_{h n})}^{^{'}}

and

p_{h} = {(p_{h 1}, . . ., p_{h p})}^{^{'}}

are the principal components and vectors, respectively, for each component number

h_{i}

and the superscript

^{^{'}}

means a transposition of the corresponding vector.

It is important to mention that

p_{h j}

represents the regression coefficient

X_{h - 1, j}

before the normalization, over the component

t_{h}

. Then, each variable can be written using a principal component analysis, deriving the following approximation:

x_{j} \approx \sum_{h = 1}^{a} t_{h} p_{h, j}^{^{'}} + R e s i d u a l, \forall j = 1 . . . p

(2)

Once convergence is achieved, i.e., the residual is less than a given tolerance, the preceding matrix is deflated to ensure the orthogonality of each of the components.

If there are missing data, in the same way, the components

t_{h}

and the vectors

P_{h}

are obtained, which then allows us to reconstruct the matrix X. This is because PFA-Nipals works with a series of scalar products as the sum of products of the elements paired. This allows us to handle missing data, adding available pairs in each operation. Geometrically, the missing elements are taken as if they adjust properly on the regression line [32].

For this stage of the method, the following steps are considered:

Residue calculation:

$X_{h - 1, j} = X_{j} - \sum_{l = 1}^{h - 1} p_{l j} t_{l} .$
Nipals initialization:

$t_{h} = X_{h - 1, 1} .$
Regression of $X_{h - 1, j}$ in $t_{h}$
where $p_{h j}$ is the slope of the line of the least squares that intersects the origin of $(t_{h}, X_{h - 1, j})$ , calculated using the available data.

$p_{h j} = \frac{\sum_{i} x_{h - 1, j i} t_{h i}}{\sum_{i} t_{h i}^{2}} for all available (nonmissing) x_{j i} and t_{h i} .$
Residue calculation:

$X_{h - 1, i} = X_{i} - \sum_{l = 1}^{h - 1} t_{l i} p_{l} .$
Regression of $X_{h - 1, i}$ in $p_{h}$
where $p_{h j}$ is the slope of the least square line that intersect the origin of $(t_{h}, X_{h - 1, j})$ , calculated using the available data.

$t_{h i} = \frac{\sum_{j} x_{h - 1, j i} p_{h j}}{\sum_{j} p_{h j}^{2}} for all available (nonmissing) x_{j i}$
Repeat steps 3 to steps 5 until the variation of $p_{h}$ between iterations is equal to or less than 0.1%.

2.2. Cluster of Features of Equal Variance

The direction vectors

p_{d, a}

are considered as input, and the variables that have similar variability are clustered in k groups, where k is the final number of features to be selected. The clustering is performed using the minibatch K-means algorithm.

The minibatch K-means algorithm is based on massive data clustering [33]. The minibatch K-means algorithm uses small batches of samples taken at random with a fixed size and for each iteration of the loop, until the algorithm converges, small randomly selected batches are taken from the dataset and used to update the clusters, assigning one cluster to each data point in the batch, taking into account the previous location of the cluster centroids. Each minibatch is then processed to update the clusters using a combination of the sample and prototype values, considering the previous locations of the cluster centroids. Therefore, as the number of iterations increases, the new examples are taken into consideration until the algorithm convergence is achieved when there are no changes in the clusters. The great advantage of this algorithm is that it subsamples the data at each iteration, in contrast to what the K-means algorithm does, thus reducing computational costs [34].

In order to help convergence, consistency of results, and computational time, an initialization “c” of the minibatch K-means algorithm is proposed, which takes as its initial value the k features that present the greatest weight for each component. In other words, a feature is selected for each component, until reaching k components. In this way, the most representative features for each component are initially selected, reducing the same variability explained between variables by the concept of the orthogonality of the components. The procedure can be described as follows:

Initialization of the feature cluster.

$T_{0} = p_{h}$

where $T_{0}$ is defined as the direction vectors $p_{h}$ with d attributes and a components.
Then, the number of features to select k is identified, and $k < d$ attributes are selected.
Cluster initialization
Initialization of each k cluster to analyze.

$t_{i} = m a x |t_{d, i}| \forall i = 1, . . ., k$

If $k < = a$ , k features are chosen (one for each component) for the first k components, without repeating selected features.
If $k > a$ , a features are chosen, one for each of the a components, without repeating the selected features, and the sequence of steps are repeated for the remaining $k - a$ components.
Clustering of features: A variation of the K-means algorithm is used, that considers minibatches to reduce computational time while optimizing the objective function, called the minibatch K-means algorithm. The minimization function is defined as follows:

$\sum_{i = 0}^{n} min_{μ_{j} \in C} {(∥x_{i} - μ_{j}∥)}^{2}$

where $μ_{j}$ is the mean or centroid of the cluster.

In Algorithm 1, the pseudocode is described, which outlines the initialization operation.

2.3. Selection of Each Group Representative

The centers of each group are calculated and the closer feature to the center is selected. Therefore, it can be considered that the feature closest to the center is the central dominant and less redundant variable of the group, compared to the characteristics of the other groups. This stage can be described as follows:

Calculation of the center of each k group.
For each group, the distance between the center of the group and each feature is compared.
The most representative feature of each group is selected by finding the minimum distance with respect to the center, using the following expression:

$m i n (d (a_{d, k} - c_{k})) \forall k = 1, 2, . . ., k$

where $d (a_{d, k} - c_{k})$ is the Euclidean distance of $a_{d, k}$ , for the d feature in group k, and $c_{k}$ is the group k center.
Output: return the k feature more representative of each group.

In Algorithm 2, the operation of PFA-Nipals is described.

Algorithm 1 Minibatch K-means initialization

Input:: k, $T_{n, a}$ : k: number of features to select; $1 \leq k \leq d$ . $T_{n, a}$ : weight matrix of the features with respect to each component.
1:: Start initialization criterion
2:: T = [ ], initialization set
3:: for $i = 1, 2, 3, . . ., k$ : do
4:: Take $T_{n, i}$ .
5:: $t_{i}$ = max( $∣ T_{1, i} ∣$ , $∣ T_{2, i} ∣$ , $∣ T_{3, i} ∣$ , $∣ T_{4, i} ∣$ , …, $∣ T_{n, i} ∣$ )
6:: Identify which $x_{i}$ feature corresponds to the selected $t_{i}$
7:: T = [ $x_{1}$ , $x_{2}$ , $x_{3}$ , $x_{4}$ , $x_{5}$ , …, $x_{k}$ ]
8:: return T, initialization list

Algorithm 2 PFA-Nipals

Input:: k: Number of features to select; $1 \leq k \leq d$ y $X_{n, d}$ , a data matrix of n observations and d dimensions centered and reduced, $X_{i, j} \in R$ y $X_{n, d}$ of rank a
1:: Start eigenvectors
2:: $X_{0} = X$
3:: for $h = 1, 2, 3, . . ., a$ : do
4:: $t_{h}$ first column of $X_{h - 1}$
5:: Start of convergence of $p_{h}$
6:: $p_{h} = X_{h - 1}^{^{'}} \frac{t_{h}}{t_{h}^{^{'}} t_{h}}$
7:: Normalize $p - h$ a 1
8:: $t_{h} = X_{h - 1} \frac{p_{h}}{p_{h}^{^{'}} p_{h}}$
9:: if there is convergence of $p_{h}$ then
10:: Go to step 13
11:: else
12:: Go back to step 5
13:: $X_{h} = X_{h - 1} - t_{h} p_{h}^{^{'}}$
14:: Extract the $p_{h}$
15:: Start minibatch K-means
16:: $T = p_{h}$ , where $T_{d, a}$ , a data matrix of d features and a components, $T_{d, a} \in R$
17:: k clusters
18:: For each c initialized ∈C, t is taken according to the initialization of T.
19:: $v \leftarrow 0$
20:: for $i = 0$ to it do
21:: $M \leftarrow m b$
22:: for $t \in M$ do
23:: $d [t] \leftarrow f (C, t)$ . The center closest to t should be cached
24:: for $t \in M$ do
25:: $c \leftarrow d [t]$ . (The cached center for t is obtained)
26:: $v [c] \leftarrow v [c] + 1$ . (Updating of the counts that exist per center)
27:: $η \leftarrow \frac{1}{v [c]}$ . (Learning rate per center)
28:: $c \leftarrow (1 - η) c + η t$ . (Gradient calculation)
29:: Obtain k clusters.
30:: Calculate the Euclidean distance between the center of the cluster with respect to all the features of the cluster.
31:: Select $m i n (| a t t r i b u t e_{i, k} - c e n t r o i d_{k} |) . \forall k = 1, 2, 3, . . ., k$ .
32:: return k selected features

3. Materials and Methods

Two experiments were performed: one based on synthetic data which shows the qualities of the selection of features through PFA-Nipals reflected in different indexes; the second is based on selecting features through different algorithms mentioned in the literature on different datasets and comparing the results with clustering metrics with respect to our algorithm.

3.1. Unsupervised Learning Problem with Synthetic Dataset

For the first experiment, a synthetic problem was used that considered two principal features of 3 Gaussian clusters as shown in Figure 1. In total, 600 observations were generated and divided into 3 clusters with 200 observations each. The features to be clustered were defined as

x_{1} \sim N (μ = 0, σ = 1)

,

x_{1} \sim N (μ = 4, σ = 1)

and

x_{1} \sim N (μ = 5, σ = 1)

for the

x_{1}

feature and

x_{2} \sim N (μ = 0, σ = 1)

,

x_{2} \sim N (μ = 4, σ = 1)

and

x_{2} \sim N (μ = - 1, σ = 1)

for the

x_{2}

feature, where

μ

and

σ

correspond to the population mean and standard deviation, respectively.

Then, ten white noise features were applied, which were generated as

a_{i} \sim N

(μ = 0,

σ = 1), \forall i = 1, 2, . . ., 10

. With these twelve (principal plus noise) features created, the synthetic dataset was generated.

Subsequently, the selection of features was made through PFA-Nipals, with which from 1 to 10 features were selected. Through the K-means algorithm, the 3 clusters mentioned above were calculated.

In order to evaluate the results of the clusters found by the K-means algorithm, two performance measures were calculated. The first one was the homogeneity metric of the labels of the clusters, which states that a clustering result satisfies the homogeneity condition, if all its clusters contain only data points that are members of a single class and its mathematical formulation [35] is:

h = 1 - \frac{H (C ∣ K)}{H (C)}

(3)

where

H (C ∣ K)

is the conditional entropy of the classes given the cluster assignments and is given by:

H (C ∣ K) = - \sum_{c = 1}^{∣ C ∣} \sum_{k = 1}^{∣ K ∣} \frac{n_{c, k}}{n} \cdot l o g (\frac{n_{c, k}}{n_{k}})

(4)

and

H (C)

is the entropy of the classes and is given by:

H (C) = - \sum_{c = 1}^{∣ C ∣} \frac{n_{c}}{n} \cdot l o g (\frac{n_{c}}{n})

(5)

with n the total number of samples,

n_{c}

and

n_{k}

the number of samples that belong to the class c and cluster k, respectively, and finally,

n_{c, k}

the number of samples of class c assigned to cluster k.

The second performance measure was the normalized mutual information (NMI), which is a normalization of the mutual information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation), and its mathematical formulation is presented in [36]. In order to understand the NMI score, suppose two label assignments (of the same N objects), U and V. Their entropy is the amount of uncertainty for a set of partitions, defined by:

H (U) = - \sum_{i = 1}^{∣ U ∣} P (i) \cdot l o g (P (i))

(6)

where

P (i) = \frac{∣ U_{i} ∣}{N}

is the probability that a randomly selected object U falls into class

U_{i}

.

The previous equations are also applied to V, shown as follows:

H (V) = - \sum_{j = 1}^{∣ V ∣} P^{^{'}} (j) \cdot l o g (P^{^{'}} (j))

(7)

where

P^{^{'}} (j) = \frac{∣ V_{j} ∣}{N}

.

The mutual information (MI) between U and V is calculated by:

M I (U, V) = \sum_{i = 1}^{∣ U ∣} \sum_{j = 1}^{∣ V ∣} P (i, j) \cdot l o g (\frac{P (i, j)}{P (i) \cdot P^{^{'}} (j)})

(8)

where

P (i, j) = \frac{∣ U_{i} \cap V_{j} ∣}{N}

is the probability that a randomly selected object belongs to both classes

U_{i}

and

V_{j}

.

Finally, the normalized mutual information is defined as:

N M I (U, V) = \frac{M I (U, V)}{m e a n (H (U), H (V))}

(9)

To calculate these performance measures, a cross-validation was performed with 80% of the data for training and 20% of the data as validation, and the K-means algorithm was executed 30 times. Thus, a mean and a standard deviation were calculated for both metrics.

To better understand the methodology described for the synthetic problem, Figure 2 is presented below.

3.2. K-Means Clustering Problem through Unsupervised Feature Selection with Missing Data

In this section, we described how we evaluated the performance of the proposed algorithm for different datasets, comparing it with feature selectors as presented in [19] and described below:

To compare the performance of the proposed algorithm, 4 UFS methods were considered with the objective to perform an empirical comparison regarding the performance of the comparison methods with the quality of the solution obtained using PFA-Nipals. The algorithms were the following:

Laplacian score: It is selected to preserve the structure of local variability. The features consistent with the Laplacian matrix are selected, in which the importance of each feature is determined by its power [37].
SPEC: It is based on spectral regression models. The features are selected one by one, taking advantage of the work of spectral graph theory [38]
MCFS: It is based on spectral analysis and sparse regression. This algorithm specializes in the selected features better preserving the structure of multiple data clusters [25].
UDFS: Features are selected by a joint framework of discriminative analysis and $L_{2, 1}$ -norm minimization. UDFS selects the most discriminative feature subset from the whole feature set in batch mode [26].

These algorithms were selected considering the following reasons:

Different algorithms were chosen based on the performance using filter methods. Laplacian score, SPEC, MCFS, and UDFS are based on the category of spectral/sparse learning methods [1].
The more relevant methods within the filter methods were considered, considering the quality of the selected features and the execution time [1,19].

3.2.1. Data

For this comparison, the datasets shown in Table 1 were considered, and an index called “ratio” was calculated, which is defined in the following equation:

r a t i o = \frac{o b s}{d i m}

(10)

where

o b s

and

d i m

correspond to the number of observations and dimensions, respectively.

This index allows the quantification of the ratio between the number of observations and the number of features, as presented in Table 1.

The selected datasets have no missing data and were drawn from two sources. The first was from the microarray gene expression sets for the application of feature selection of maximum relevance and minimum redundancy [4] (http://home.penglab.com/proj/mRMR/, accessed on 1 April 2021), the second was obtained from the UCI Machine Learning Repository [39] (https://archive.ics.uci.edu/ml/index.php, accessed on 1 April 2021).

Since the selected datasets did not have missing data, these were artificially created for 0%, 1%, 2%, 3%, 4%, and 5% of the total observations of each dataset. These percentages were selected since the bias introduced by the missing data is proportional to the number of losses so that 10% or more is unacceptable and up to 5% is admissible [40].

3.2.2. Feature Selection

The feature selection was divided into two conditions. The first was based on algorithms that did not need to estimate missing data, which, in our case, was PFA-Nipals, and the second for algorithms that needed the estimation such as the Laplacian score, MCFS, and SPEC. For the first condition, the selection of features was done directly, without the need to estimate.

For the second condition, an estimate of missing data was performed by imputation to complete the missing data by replacement using the mean of each feature.

The number of features selected for each dataset was from 2 to 25 features for each of the missing data percentages. The number of clusters for each dataset is shown in Table 2.

Herewith, it was expected that each algorithm would deliver the features selected for each dataset.

3.2.3. Cluster Analysis

For this analysis, the datasets without missing data were considered, since we wanted to analyze the effectiveness of the selector algorithms with missing data, but not that of some cluster algorithm. With the features selected by each algorithm, clusters were built for each dataset through the K-means algorithm.

The number of clusters for each dataset was established by performing a cluster silhouette analysis, choosing the number of clusters that gave the best silhouette coefficient [41]. These results can be observed in Table 2.

3.2.4. Validation and Performance Indexes

For the datasets under study, the true labels of the clusters were not known, so the evaluation was carried out using the cluster model itself. Therefore, two metrics were considered. The first was the silhouette coefficient, which generates limited scores between −1 (for incorrect clusters) and 1 (for correct clusters), where a higher score for the silhouette coefficient is related to a model with better-defined clusters and is established for each sample and composed of two scores defined for a sample as follows:

s = \frac{b - a}{m a x (a, b)}

(11)

where a is the mean distance between a sample and all other points of the same class, and b corresponds to the mean distance between a sample and all other points in the next closest cluster.

The silhouette coefficient for a set of samples is given as the mean of the silhouette coefficient for each sample. The second metric is the Davies–Bouldin index [42], where a lower Davies–Bouldin index is related to a model with a better separation between clusters. The Davies–Bouldin index is defined as the average similarity of each cluster

C_{i}

for

i = 1,

2, . . ., k

and its closest similarity

C_{j}

. In the context of this index, similarity is defined as a mean

R_{i, j}

.

R_{i, j} = \frac{s_{i} + s_{j}}{d_{i, j}}

(12)

where

s_{i}

is the mean distance between each point in the cluster i and the center of that cluster (also known as the cluster diameter), and

d_{i, j}

is the distance between the cluster centroids i and j.

Therefore, the Davies–Bouldin index is defined as:

D B = \frac{1}{k} \sum_{i = 1}^{k} max_{i \neq j} R_{i, j}

(13)

where k corresponds to the maximum number of clusters.

To calculate these performance measures, a cross-validation was performed with 80% of the data for training and 20% of the data as validation, and the K-means algorithm was executed 30 times. Thus, the mean value and standard deviation of each of the mentioned indexes were calculated. To better understand what has been described, Figure 3 is shown below.

4. Results and Discussion

The results obtained for each of the problems described in the previous section are shown below.

4.1. Results of Unsupervised Learning Problem with Synthetic Dataset

As mentioned in the methodology, the homogeneity metrics and normalized mutual information were calculated with the three clusters found by K-means clustering for each cluster of selected features.

Regarding Figure 4, we can observe that the number of optimal features to select is two. This is because the homogeneity metric and normalized mutual information for two features is one of the highest with a minimum standard deviation.

With what was described above, we can mention that in each of the 30 repetitions, 480 observations were used as training and 120 observations as validation, assigning them to the closest cluster. The results, for both homogeneity metric and normalized mutual information, show that only two features are necessary for clustering and the two features are correct (two principal features).

4.2. Results of the K-Means Clustering Problem through the Selection of Unsupervised Features with Missing Data

In this section, a table is shown for each dataset, indicating the best performance of each algorithm with respect to the missing data analysis for each dataset under study. These performance values are shown regardless of the available number of features for a result. Along with the above, a graph was created, which reflected the evolution of each index with respect to the number of selected features, considering 5% of missing data.

The results were analyzed for each of the datasets and showed the following results:

4.2.1. Colon Cancer

For the Colon Cancer dataset, Table 3 shows that for most of the analysis with missing data, PFA-Nipals performed better on the silhouette coefficient than the other algorithms, except for 0% of missing data where the Laplacian score obtained a better result only for the silhouette coefficient, since for the Davies–Bouldin index, the best score was obtained by PFA-Nipals. Regarding the Davies–Bouldin index, PFA-Nipals performed better for all analyses with missing data.

Furthermore, the Colon Cancer dataset for the 5% missing data case in Figure 5 shows that the best results for all algorithms were found in the selection of the first 10 features. Along with this, it can be mentioned that for this dataset, PFA-Nipals obtained very good results with very few selected features.

For the Colon Cancer dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.2.2. Lung Cancer

For the Lung Cancer dataset, Table 4 shows that PFA-Nipals performed better on the silhouette coefficients for all cases. Regarding the Davies–Bouldin index, PFA-Nipals also obtained the best results for all cases. Furthermore, for the 5% missing data case, Figure 6 again shows that the best results were found in the selection of the first 10 features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the Lung Cancer dataset, we can again mention that there are indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.2.3. Lymphoma

For the Lymphoma dataset, Table 5 shows that PFA-Nipals performed better on the silhouette coefficient for all cases of missing data. Regarding the Davies–Bouldin index, PFA-Nipals again obtained the best results for all cases of missing data.

In addition, for the 5% missing data case, Figure 7 shows again that the best results were found in the selection of the first 10 features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the Lymphoma dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.2.4. Arrhythmia

For the Arrhythmia dataset, Table 6 shows that PFA-Nipals performed better on silhouette coefficients for 2% and 3% of missing data. Regarding the Davies–Bouldin index, PFA-Nipals achieved the best results for the 2% and 3% of missing data cases.

Furthermore, for the 5% missing data case, Figure 8 shows that the best results were found in the selection of the first five features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the Arrhythmia dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

It is important to mention that the Laplacian score was not calculated due to a nonconvergence of this parameter for the present dataset.

4.2.5. NCI

For the NCI dataset, Table 7 shows that PFA-Nipals performed better on silhouette coefficients for 0%, 1%, 4%, and 5% of missing data. Regarding the Davies–Bouldin index, PFA-Nipals achieves the best results for the 0%, 1%, 4%, and 5% missing data cases.

In addition, for the 5% missing data case, Figure 9 shows that the best results were found in the selection of the first three features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the NCI dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.2.6. Soybean

For the Soybean dataset, Table 8 shows that PFA-Nipals performed better on the Davies–Bouldin index, considering 1% of missing data.

In addition, for the 5% missing data case, Figure 10 shows that the best results were found in the selection of the first five features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

4.2.7. Leukemia

For the Leukemia dataset, Table 9 shows that PFA-Nipals performed better on silhouette coefficients for 0%, 1%, 2%, 3%, 4%, and 5% of missing data. Regarding the Davies–Bouldin index, PFA-Nipals obtained the best results for the 0%, 1%, 2%, 3%, 4%, and 5% of missing data cases.

Furthermore, for the 5% missing data case, Figure 11 shows that the best results were found in the selection of the first 25 features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the Leukemia dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.2.8. Epileptic Recognition

For the Epileptic Recognition dataset, Table 10 shows that PFA-Nipals performed better on silhouette coefficients for all cases of missing data, being surpassed in all cases by UDFS. Regarding the Davies–Bouldin index, PFA-Nipals obtained the best results for the 0%, 2%, 4%, and 5% of missing data cases. In this case, PFA-Nipals excelled at obtaining more separate and less sparse clusters.

In addition, for the 5% missing data case, Figure 12 shows that the best results were found in the selection of the first five features. Along with this, it can be noted that PFA-Nipals achieved very good results with few selected features.

For the Epilectic Recognition dataset, we can have indications that as the number of missing data increases, PFA-Nipals selects features that, through K-means clustering, find better-defined clusters and with a better separation between the clusters.

4.3. Analysis of Results

We can mention that for the Colon Cancer dataset with 5% missing data, PFA-Nipals improved the definition of the clusters between 14% and 18% and also improved the separation between the clusters between 8% and 28% with respect to SPEC, Laplacian, and MCFS. Moreover, for the Lung Cancer dataset with 5% missing data, PFA-Nipals improved the definition of the clusters between 43% and 65% and also improved the separation between the clusters between 8% and 28% with respect to SPEC, Laplacian, and MCFS. Nonetheless, for the Lymphoma dataset with 5% missing data, PFA-Nipals improved the definition of the clusters between 20% and 84% and improved the separation between the clusters between 35% and 81% with respect to SPEC, Laplacian, and MCFS. The least favorable results for PFA-Nipals were obtained with the Soybean dataset, observing a reduction in the percentage variation between 8% and 25% for the definition of clusters, and between 0% and 369% for the separation between clusters.

In order to determine the observed differences between the performance of the proposed PFA-Nipals and other algorithms and the statistical significance of the results, a nonparametric test was conducted, the denominated Mann–Whitney U-test for the least favorable results obtained, i.e., using the silhouette coefficient results.

The hypothesis tests were the following:

H_{0} : S a m p l e_{1} - S a m p l e_{2} = 0

H_{a} : S a m p l e_{1} - S a m p l e_{2} > 0

where

S a m p l e_{1}

are the silhouette coefficient results of PFA-Nipals and

S a m p l e_{2}

are the results of the silhouette coefficient for all other algorithms.

A

95 %

confidence level for the hypothesis tests was considered and 10,000 Monte Carlo simulation were used to calculate the p-value.

From Table 11, the hypothesis test results indicated that for the Colon Cancer, Lymphoma, NCI, and Leukemia datasets, the p-value calculated was smaller than the significant level

α = 0.05

. Hence, the null hypothesis

H_{0}

was rejected and the alternative hypothesis

H_{a}

was accepted for all comparison tests.

For the Lung Cancer dataset, hypothesis

H_{0}

was rejected for the PFA Nipals–Laplacian score and PFA Nipals–SPEC comparisons.

For the Arrhythmia dataset, hypothesis

H_{0}

was rejected for PFA Nipals–MCFS, PFA Nipals–Laplacian score, and PFA Nipals–UDFS comparisons.

For the Soybean dataset, hypothesis

H_{0}

was rejected only for the PFA Nipals–UDFS comparison, and for the Epileptic Recognition dataset, hypothesis

H_{0}

was rejected only for the PFA Nipals–MCFS comparison.

With these results, we can mention that the estimation of missing data causes a bias in the selection of unsupervised features. Therefore, when it comes to being in the presence of missing data, it is recommended to use a feature selector that works with missing data, such as PFA-Nipals.

Finally, we can see that with more observations, PFA-Nipals achieves better results in the presence of missing data compared to other UFS algorithms. We can also notice that since Lymphoma is the dataset with the greatest difference between algorithms followed by Lung Cancer, Colon Cancer, Leukemia, NCI, Arrhythmia, Epileptic, and Soybean, respectively, PFA-Nipals obtains better results for datasets that present a greater quantity of features in relation to the system observations, which are reflected in the ratio index shown in Table 1.

It can be mentioned that the significance test performed to determine the observed differences for the silhouette coefficient is important in relation to all algorithms in datasets where the number of features is higher than the number of observations (Colon Cancer, Lymphoma, NCI, and Leukemia).

In addition, we can mention that the dataset space can be transformed through iterative nonlinear estimations by partial least squares to calculate the weights of each feature for each component and then separate clusters of weights of equal variance, minimizing the inertia within the cluster and using mini-batches in order to select a subset of features from the original dataset that contains most of the total variability to be explained and that can work with missing data, without having to be estimated previously.

4.4. Execution Times

In terms of execution times for PFA-Nipals, the results depicted in Figure 13 illustrate a significant increase in execution times when dealing with missing data. This trend is further clarified in Table 12, where we observe varying degrees of computational time increases. Notably, the smallest increase in computational time was recorded for the Soybean dataset, with a small 10.67% rise between 0% and 1% of missing data. Comparatively, the Lung Cancer dataset experienced a 40.45% increase, the Colon Cancer dataset showed a 67.34% increase, and the Epileptic dataset presented the most substantial increase, with 406.14%.

These findings indicate a correlation between computational time increase and the quantity of data being analyzed, as well as the disparity between features and observations. Specifically, the results suggest that datasets with a higher ratio of features to observations exhibit a relatively low increase in computational time compared to datasets where the number of observations surpasses the number of features while maintaining an equivalent number of data points. This relationship becomes evident when comparing the NCI and Leukemia datasets to the Arrhythmia dataset.

5. Discussion

Two highly important problems were considered in this study. The first focused on the reduction in dimensionality, which is an important preprocessing step for machine learning, and the second was based on the absence or missing data on the features of systems to be studied.

The importance of the aforementioned is that when it is decided that a feature is unnecessary, we save costs by extracting it; in addition, simple models are more robust in small datasets, by having less variation in sample details. Together with the above, the fewer features, the better the behavior of a particular system or phenomenon. Regarding missing data, the estimation or elimination of observations produces biases when selecting relevant features for the system.

Therefore, in this paper, we proposed an unsupervised feature selection method called PFA-Nipals, for which we applied a synthetic clustering problem which was able to select the two principal features to group the clusters through the K-means algorithm correctly.

To validate our algorithm and prove that it performed well with missing data for feature selection, experiments were performed on datasets which were treated as clustering data. It can be observed from the results that the proposed algorithm outperformed all competitors in six out of eight datasets without the presence of missing data.

In addition, we can mention that as the number of missing data increased, PFA-Nipals selected features that, through K-means clustering, found better-defined clusters with a better separation between the clusters compared to other UFS algorithms. Finally, it was observed that the proposed algorithm obtained faster execution times with respect to the ones presented in the state of the art, reducing overall computational times.

Therefore, we conclude that the proposed algorithm is robust in selecting features with missing data and has excellent performance for datasets with more dimensions than observations. Its benefits make it especially suitable for unsupervised learning.

We expect that the proposed algorithm could be used widely within different areas, for example, for early detection of diseases in medical sciences, to analyze weather data to forecast extreme events, and also to study the key features in attenuation laws applied to earthquake engineering, as in all mentioned cases, experimental data are usually large datasets with inherent missing data.

Finally, as a summary, the main contributions of the proposed method PFA-Nipals are: for unsupervised feature selection, the overall computational time was reduced when unnecessary features were eliminated; it improved the robustness of simple models with small datasets by reducing irrelevant variations; it helped explain the data with fewer features, leading to a deeper understanding of the underlying processes; it can work with missing data without the need to estimate or remove observations; it is reliable when the number of attributes exceeds the number of observations, and the observations are insufficient for effective feature selection training; it can find a consistent set of relevant features without biasing the explained variability.

Author Contributions

Conceptualization, E.C.-I. and C.A.A.; methodology, E.C.-I., C.A.A. and I.F.-H.; software, E.C.-I. and I.F.-H.; validation, E.C.-I., I.F.-H. and M.A.A.; formal analysis, E.C.-I., C.A.A., M.A.A. and I.F.-H.; investigation, E.C.-I. and I.F.-H.; resources, M.A.A. and E.C.-I.; data curation, E.C.-I.; writing—original draft preparation, E.C.-I. and I.F.-H.; writing—review and editing, I.F.-H.; visualization, E.C.-I. and I.F.-H.; supervision, C.A.A., M.A.A. and I.F.-H.; project administration, C.A.A. and M.A.A.; funding acquisition, I.F.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly funded by the BECA ESTUDIO DE DOCTORADO, UNIVERSIDAD DE TALCA. The APC was funded by the Faculty of Engineering, Campus Curicó, University of Talca.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A review of unsupervised feature selection methods. Artif. Intell. Rev. 2020, 53, 907–948. [Google Scholar] [CrossRef]
Baştanlar, Y.; Ozuysal, M. Introduction to machine learning. Methods Mol. Biol. 2014, 1107, 105–128. [Google Scholar] [CrossRef] [PubMed]
Mao, K. Identifying Critical Variables of Principal Components for Unsupervised Feature Selection. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2005, 35, 339–344. [Google Scholar] [CrossRef] [PubMed]
Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the 2003 IEEE Bioinformatics Conference; CSB: Stanford, CA, USA, 2003; pp. 523–528. [Google Scholar] [CrossRef]
Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef] [PubMed]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Kim, S.B.; Rattakorn, P. Unsupervised feature selection using weighted principal components. Expert Syst. Appl. 2011, 38, 5704–5710. [Google Scholar] [CrossRef]
Zhao, Z.A.; Liu, H. Spectral Feature Selection for Data Mining; CRC Press: Boca Raton, FL, USA, 2011; Volume 1, p. 216. [Google Scholar]
Groves, R.M. Survey Errors and Survey Costs; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1989. [Google Scholar] [CrossRef]
Schafer, J.L. Analysis of Incomplete Multivariate Data; Chapman and Hall/CRC: Boca Raton, FL, USA, 1997. [Google Scholar] [CrossRef]
Buck, S.F. A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer. J. R. Stat. Soc. Ser. B (Methodol.) 1960, 22, 302–306. [Google Scholar] [CrossRef]
Pastor, J.B.N.; Vidal, J.M.L. Análisis de datos faltantes mediante redes neuronales artificiales. Psicothema 2000, 12, 503–510. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm A. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar] [CrossRef]
Rosas, J.F.M.; Álvarez Verdejo, E. Métodos de imputación para el tratamiento de datos faltantes: Aplicación mediante R/Splus. Rev. MéTodos Cuantitativos Para Econ. Empresa 2009, 7, 3–30. [Google Scholar]
Jolliffe, I.T. Discarding Variables in a Principal Component Analysis. I: Artificial Data. Appl. Stat. 1972, 21, 160. [Google Scholar] [CrossRef]
Kim, Y.B.; Gao, J. Unsupervised Gene Selection For High Dimensional Data. In Proceedings of the Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE’06), Arlington, VA, USA, 16–18 October 2006; pp. 227–234. [Google Scholar] [CrossRef]
Li, Y.; Lu, B.L.; Zhang, T.F. Combining Feature Selection With Extraction: Component Analysis. Int. J. Artif. Intell. Tools 2009, 18, 883–904. [Google Scholar] [CrossRef]
Lu, Y.; Cohen, I.; Zhou, X.S.; Tian, Q. Feature Selection Using Principal Feature Analysis. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 24–29 September 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 301–304. [Google Scholar] [CrossRef]
Chang, X.; Nie, F.; Yang, Y.; Zhang, C.; Huang, H. Convex Sparse PCA for Unsupervised Feature Learning. ACM Trans. Knowl. Discov. Data 2016, 11, 1–16. [Google Scholar] [CrossRef]
Wan, M.; Lai, Z. Feature Extraction via Sparse Difference Embedding (SDE). KSII Trans. Internet Inf. Syst. 2017, 11, 3594–3607. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, X.; Wang, R.; Zheng, W.; Zhu, Y. Self-representation and PCA embedding for unsupervised feature selection. World Wide Web 2018, 21, 1675–1688. [Google Scholar] [CrossRef]
Bouveyron, C.; Latouche, P.; Mattei, P.A. Bayesian variable selection for globally sparse probabilistic PCA. Electron. J. Stat. 2018, 12, 3036–3070. [Google Scholar] [CrossRef]
Solorio-Fernández, S.; Ariel Carrasco-Ochoa, J.; Martínez-Trinidad, J.F. A systematic evaluation of filter Unsupervised Feature Selection methods. Expert Syst. Appl. 2020, 162, 113745. [Google Scholar] [CrossRef]
Mitra, P.; Murthy, C.; Pal, S. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 301–312. [Google Scholar] [CrossRef]
Cai, D.; Zhang, C.; He, X. Unsupervised Feature Selection for Multi-Cluster Data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 333–342. [Google Scholar] [CrossRef]
Yang, Y.; Shen, H.T.; Ma, Z.; Huang, Z.; Zhou, X. L2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, 16–22 July 2011; AAAI Press: Menlo Park, CA, USA, 2011; Volume 2, pp. 1589–1594. [Google Scholar]
Li, Z.; Yang, Y.; Liu, J.; Zhou, X.; Lu, H. Unsupervised Feature Selection Using Nonnegative Spectral Analysis. Proc. Natl. Conf. Artif. Intell. 2012, 2, 1026–1032. [Google Scholar] [CrossRef]
Tang, C.; Liu, X.; Li, M.; Wang, P.; Chen, J.; Wang, L.; Li, W. Robust unsupervised feature selection via dual self-representation and manifold regularization. Knowl.-Based Syst. 2018, 145, 109–120. [Google Scholar] [CrossRef]
Liu, T.; Hu, R.; Zhu, Y. Completed sample correlations and feature dependency-based unsupervised feature selection. Multimed. Tools Appl. 2022, 82, 1–22. [Google Scholar] [CrossRef]
Alabsi, B.A.; Anbar, M.; Rihan, S.D.A. CNN-CNN: Dual Convolutional Neural Network Approach for Feature Selection and Attack Detection on Internet of Things Networks. Sensors 2023, 23, 6507. [Google Scholar] [CrossRef] [PubMed]
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Tenenhaus, M. La Régression PLS Théorie et Pratique; Editions TECHNIP: Paris, France, 1998. [Google Scholar]
Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh North, CA, USA, 26–30 April 2010; pp. 1177–1178. [Google Scholar] [CrossRef]
Steinbach, M.; Karypis, G.; Kumar, V. A Comparison of Document Clustering Techniques. 2000. Available online: https://conservancy.umn.edu/handle/11299/215421 (accessed on 20 January 2022).
Rosenberg, A.; Hirschberg, J. V-Measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic, 28–30 June 2007; pp. 410–420. [Google Scholar]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar] [CrossRef]
He, X.; Cai, D.; Niyogi, P. Laplacian Score for Feature Selection. Adv. Neural Inf. Process. Syst. 2005, 18, 507–514. [Google Scholar]
Zhao, Z.; Liu, H. Spectral Feature Selection for Supervised and Unsupervised Learning. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 1151–1157. [Google Scholar] [CrossRef]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://ergodicity.net/2013/07/ (accessed on 1 April 2021).
Dagnino, J. Bioestadística y Epidemiología DATOS FALTANTES (MISSING VALUES). Rev. Chil. Anest. 2014, 43, 332–334. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]

Figure 1. Three clusters with a Gaussian data distribution.

Figure 2. Methodology for unsupervised learning problem with synthetic dataset.

Figure 3. Methodology applied to the K-means clustering problem through the selection of unsupervised features with missing data.

Figure 4. Metric of a cluster versus number of features selected for data synthetic.

Figure 5. Evolution of indexes with respect to the number of selected features for 5% of missing data. Colon Cancer dataset.

Figure 6. Evolution of indexes with respect to the number of features selected for 5% of missing data. Lung Cancer dataset.

Figure 7. Evolution of indexes with respect to the number of features selected for 5% of missing data. Lymphoma dataset.

Figure 8. Evolution of indexes with respect to the number of selected features for 5% of missing data. Arrhythmia dataset.

Figure 9. Evolution of indexes with respect to the number of selected features for 5% of missing data. NCI dataset.

Figure 10. Evolution of indexes with respect to the number of selected features for 5% of missing data. Soybean dataset.

Figure 11. Evolution of indexes with respect to the number of selected features for 5% of missing data. Leukemia dataset.

Figure 12. Evolution of indexes with respect to the number of selected features for 5% of missing data. Epileptic Recognition dataset.

Figure 13. PFA-Nipals execution times for 0%, 1%, 2%, 3%, 4%, and 5% of missing data.

Table 1. Datasets used in this study. The table shows the name of the datasets, as well as the number of observations and dimensions, and the calculated ratio.

Dataset	Observations (n)	Dimensions (d)	Ratio = $n / d$
Arrhythmia	452	279	1.620
Lung Cancer	73	325	0.2246
Lymphoma	96	4026	0.0238
Colon Cancer	61	2000	0.0305
NCI	60	9712	0.0061
Soybean	307	35	8.7714
Leukemia	72	7070	0.01018
Epileptic Recognition	11,500	179	64.2458

Table 2. Number of selected features and number of clusters of the datasets.

Dataset	Number of Clusters	Number of Selected Features
Arrhythmia	3	[2, 3, 4, 5, 6, 7, 8, …, 25]
Colon	2
Lung	2
Lymphoma	3
NCI	2
Soybean	3
Leukemia	2
Epileptic Recognition	5

Table 3. Best index results regarding the percentage of missing data. Dataset: Colon Cancer.

% of	Silhouette Coefficient					Davies–Bouldin Index
Missed Data	PFA-Nipals	SPEC	Lap *	MCFS	UDFS	PFA-Nipals	SPEC	Lap	MCFS	UDFS
0%	0.5168	0.5136	0.5814	0.4890	0.3765	0.6475	0.7219	0.7060	0.8008	1.0716
1%	0.5975	0.5186	0.4026	0.4607	0.4396	0.5730	0.7257	0.9974	0.8250	0.9490
2%	0.5756	0.5211	0.0900	0.3843	0.5438	0.5760	0.7277	1.0345	0.9762	0.6673
3%	0.5466	0.5161	0.3177	0.3664	0.4426	0.6964	0.7238	1.1378	1.0047	0.8665
4%	0.5168	0.5157	0.3878	0.4651	0.5020	0.7198	0.7251	0.9745	0.7878	0.7241
5%	0.7126	0.5186	0.4680	0.5603	0.4734	0.3173	0.7257	0.8271	0.6245	0.7899