There are several instances where feature selection is preferred over feature extraction: the interpretability of the features to understand the relationship between different variables; computational efficiency, where feature selection allows faster computing times; to reduce overfitting when the datasets used are small; domain knowledge of some important features and allowing them to be selected first; finally, when a high correlation between features is observed, feature selection can identify the most relevant ones and therefore reduce dimensionality.
Since unsupervised data in the real world may contain irrelevant or redundant features that can affect the analysis and lead to biases or incorrect results [
1], it becomes crucial to eliminate unnecessary features to improve computational efficiency and enhance the robustness of simple models on small datasets due to their lower variation in sample details [
1,
2]. By explaining the data with fewer features, we gain a deeper understanding of the underlying process, enabling the extraction of valuable knowledge [
2]. This becomes especially relevant when seeking to understand the behavior of specific systems or phenomena [
8].
Different strategies have been proposed to deal with missing data, including imputation and deletion. For imputation, the following techniques can be considered:
This study proposes a novel UFS method based on an iterative nonlinear estimation of components by partial least squares, in which feature weights are calculated for each component and clustered considering a minibatch approach in order to reduce calculation time. These minibatches are subsets of the weights of each feature with respect to the components, finally delivering the cluster of similar features, and selecting the representative of each cluster. Our feature selection method offers a low computational load while preserving most of the original variance, working with missing data without the need for imputation or deletion.
Survey of UFS Methods
In the literature survey regarding the UFS methods, the terms of feature selection and feature extraction for unsupervised features are sometimes understood as the same. However, they are distinct: feature selection refers to the selection of a subset of features from the original space, while the feature extraction technique selects features in a transformed space, which do not have a clear physical meaning [
3].
Principal component analysis (PCA) is frequently used as a feature extraction algorithm since it forms linear combinations of all available features. However, these features contemplated by each PCA does not have equal importance in the formation of PCs necessarily, since some features may be critical and others irrelevant or insignificant to the overall analysis [
3].
Some authors linked PCs to a subset of the original features by selecting critical variables or eliminating redundant, irrelevant, or insignificant variables. In this sense, B2 (backward elimination) and B4 (forward binding) algorithms are probably the best-known approaches of this type [
15]. The B2 algorithm discards variables that are highly associated with the last PCs, while B4 algorithm selects variables that are highly associated with the first PCs.
As shown in [
3], fundamental problems are solved: the first is based on finding a UFS algorithm with low computational time and the second is based on giving interpretability to the PCA. For this, the author relies on the evaluation of features for their ability to reproduce the projections on the main axes, by using regression through ordinary least squares (OLS) and by selecting forward features and eliminating backward features.
On the other hand, a new hybrid approach to PC-based unsupervised gene selection is developed in [
16], which is divided into two steps: the first retrieves subsets of genes with an original physical significance based on their capabilities to reproduce sample projections into principal components by applying the evaluation through an OLS estimation, and the second step looks for the best subsets of genes that maximize the performance of the cluster. In [
7], a moving range threshold is proposed, which is based on quality control and control charts to better identify the significant variables of the dataset.
On the other hand, obtaining the eigenvectors to identify critical original features based on the nearest-neighbor rule to find the predominant features is proposed in [
17].
In [
18], a method called principal feature analysis (PFA) is proposed. This method uses PCs to calculate the eigenvalues and eigenvectors, which are grouped using k-means and return the feature closest to the center.
Later, a convex scattered principal component algorithm (CSPCA) is proposed in [
19], applied to feature learning. It is shown that PCA can be formulated as a robust low-rank regression optimization problem for outliers. They also mention that the importance of each feature can be analyzed effectively according to the PCA criteria, since the new objective function is convex, and they propose an iterative algorithm to optimize it, generating weights for each original feature.
On the other hand, an improved and unsupervised difference algorithm called sparse difference embedding (SDE) to reduce the dimensionality of data with high numbers of features and few observations was developed in [
20]. SDE seeks to find a set of projections that not only affects the locality of the intraclass and maximizes the globality of the interclass but also simultaneously applies the lasso regression to obtain a sparse transformation matrix.
An algorithm coupling a PCA regularization with a sparse feature selection model to preserve the maximum variance of the data is proposed in [
21], where an iterative minimization optimization is designed in order to optimize an objective function. In this way, it looks for a significant adjustment result and the selection of informative features, respectively.
On the other hand, another type of feature selection algorithm, called the globally sparse probabilistic PCA (GSPPCA) was generated in [
22], which is based on a Bayesian procedure that allows the obtention of several sparse components with the same scatter pattern to identify relevant features. This selection of features by GSPPCA is achieved using Roweis’s probabilistic interpretation in PCA and isotropic Gaussian functions in the load matrix, providing calculations of the marginal probability of a Bayesian PCA model.
However, other types of unsupervised feature selection methods exist, based on the usage of filters and not necessarily by means of a principal component analysis. In [
23], unsupervised feature selection methods based on filters received more attention due to their efficiency, scalability, and simplicity. Therefore, the most recent articles are presented as follows:
In [
24], a method called FSFS (feature selection using feature similarity) is proposed, with the primary objective of reducing the redundancy between the characteristics of a dataset by measuring the dependency or statistical similarity between them using the variance and covariance of the features to be later grouped in clusters. Features with similar properties are located in the same cluster. Finally, the feature selection is performed iteratively using the kNN principle (k-nearest neighbors). At each iteration (one for each feature), FSFS selects one feature of each cluster to create a final subset of features.
In [
25], a method called MCFS (multicluster feature selection) is proposed, with the objective of selecting subsets of features from a dataset, preserving the overall structure of the clustered data. MCFS is an algorithm that uses spectral analysis and regression with the
norm to measure the importance of the features, considering the local data structure to later select a subset of features based on the regression coefficients. The final objective is to maximize the preservation of the internal data structure of the selected subset of features.
In [
26], a method called UDFS (unsupervised discriminative feature selection) is proposed, with the objective of selecting a subset of features that discriminates among the data to perform regression and classification tasks. The algorithm uses discriminant information from the sparse matrices and correlates the features to assign weights to each characteristic and select the most relevant ones. This facilitates the overall efficiency and precision of the automatic learning models by reducing the number of features while maintaining the most informative attributes.
In [
27], a method called NDFS (nonnegative discriminative feature selection) is presented, which is a technique for feature selection combining different steps. First, it uses spectral analysis to learn pseudoclass labels, representing the relations between the data and their features. Then, a regression model with a norm
regularization is performed to be optimized using a specific solver. Finally,
p features are selected that relate the best with the previously learned pseudoclass labels. This process allows the authors to identify a set of relevant and discriminant features to solve classification problems.
In [
28], a method called DSRMR (dual self-representation and manifold regularization) is proposed. It is a learning algorithm based on a representation of the features and has the objective of selecting an optimal set of relevant features for a learning problem. DSRMR achieves this by considering three fundamental aspects: a representation of the features using an
norm, the local geometrical structure of the original data, and a weighted representation of each feature based on its relative importance. The optimization is performed efficiently using an iterative algorithm and the final results are based on the resulting weighted features.
In [
29], an innovative method for feature selection is proposed based on the sample correlations and dependencies between features. This model is composed of two main elements: to preserve the global and local structure of the dataset and to consider the global and local information. This is achieved by using mutual information and learning techniques, such as dynamical graphs and low-dimensionality restrictions to obtain reliable information and a correlation of the feature samples.
Finally, it is worth mentioning that in [
30], a method was proposed using two convolutional neural networks (CNN) to extract the principal features of different datasets, obtaining excellent results in comparison to other state-of-the-art methods. This method was tested using IoT devices to study signals and identify them as malicious or normal.
All these algorithms mentioned above propose to take PCs as the central core to select relevant features, either as a means of feature selection or as a means of interpretability for PCs. However, none of them solve the problem of missing data, having to perform a preprocessing of the datasets to correct them.