1. Introduction
In the ever-expanding landscape of data-driven applications, unsupervised learning techniques play a pivotal role in extracting meaningful patterns from raw data. Clustering, as one of the fundamental tasks in unsupervised learning, seeks to group similar data points together while maintaining separation between distinct clusters [
1]. Several research works have been carried out, and various clustering approaches have been proposed, including kernel methods such as support vector machine (SVM) [
2], self-organizing maps (SOM) [
3], and k-means clustering [
4]. However, achieving optimal clustering results remains a challenging endeavor due to various factors such as noisy features, high dimensionality, and the need for robust initialization.
Unsupervised feature selection (UFS), on the other hand, aims to identify the most relevant subset of features from the original feature space [
5]. By selecting informative features, computational complexity would be reduced, and the quality of clustering results could also be enhanced. Based on evaluation criteria, the current unsupervised feature-selection studies can also be categorized into two primary groups: wrapper and filter-based studies [
6]. The evaluation criterion for wrapper-based techniques is the chosen features’ classification performance. On the other hand, the assessment criterion in filter techniques remains unaffected by the machine-learning technology. The filter approaches employ a variety of metrics, including distance measurements [
7], consistency measures [
8], correlation measures [
9], and information theory-based measures [
10]. Wrapper methods generally outperform filter methods because they evaluate the performance of the unsupervised selected features on a classification algorithm, even though filter methods are usually less computationally expensive [
11]. However, these selection techniques continue to face issues with high computational time and convergence to local optima [
12]. In addition, traditional unsupervised feature-selection methods often operate independently of the clustering algorithm, overlooking the inherent synergy between feature selection and the clustering process [
13]. Metaheuristic techniques have been frequently adopted recently due to their robust global-search capabilities, which help overcome these shortcomings, especially when the number of features increases. Some of these classic metaheuristic algorithms, most widely applied to the unsupervised feature-selection and clustering problems, include the genetic algorithm (GA) [
14,
15,
16], particle swarm optimization (PSO) [
17,
18,
19], and harmony search algorithms, among others [
20,
21].
The SOS technique, first introduced by [
22], is a stochastic metaheuristic approach using randomization to determine a collection of solutions. Based on the interactions between species in an ecosystem, the SOS algorithm was designed with a faster convergence time and greater robustness than these classic metaheuristic algorithms [
23]. When compared to other population-based metaheuristic algorithms that searched for near-optimal solutions by training a set of candidate solutions using population characteristics to iteratively guide the searching, like the ant colony optimization (ACO) algorithm, the SOS algorithm is better for three key reasons. One benefit of the mutualism and commensalism stages of the SOS algorithm is that it concentrates on creating new creatures, which makes it possible for the algorithm to find a variety of solutions. It follows that the algorithm becomes more adept at exploring. For a second reason, the parasitism phase makes the algorithm more exploitation-capable by keeping it from being stuck in local optima. The final benefit is that there are just two general parameters for the SOS algorithm: the maximum number of iterations and the population size. Because of all these benefits, the SOS algorithm is widely used and has been modified to address a variety of optimization issues across a number of industries. Recently, to enhance its performance, modified [
24] and hybrid [
25] versions of the SOS algorithm have been developed as an alternative to the initial SOS algorithm proposed by [
22]. Ref. [
20] addressed the supervised feature-selection issue for 19 datasets from the UCI repository using the binary version of the SOS algorithm. The results indicated that, for the majority of datasets, the binary SOS algorithm may achieve a high classification accuracy with the fewest characteristics. The SOS algorithm has also been used to solve multi-objective problems in optimization. A multi-objective symbiotic organism search technique based on the weighted-sum method was proposed by [
26] as a supervised learning method for economic/emission dispatch problems in power systems. The proposed method has been found to outperform other optimization algorithms such as the genetic algorithm (GA), differential evolution (DE), particle swarm optimization (PSO), the bees algorithm (BA), the mine blast algorithm (MBA), ant colony optimization (ACO), and cuckoo search (CS).
The application of SOS algorithms has since increased, particularly in the engineering field [
27]. Though unsupervised learning has the capability to improve computational efficiency and retrieval recall, very few studies has been carried out in the literature specifically addressing unsupervised learning problems such as feature selection and clustering [
27,
28]. Previous studies concentrated on identifying optimal feature selection for brain–computer interfaces [
26] and satellite image classification issues [
29]. Within the literature, Refs. [
30,
31] explored text clustering and feature selection utilizing the SOS method. In their empirical research, Ref. [
32] addressed text classification problems, and [
31] proposed an SOS-based approach for feature extraction issues. Though the literature provides a larger proportion of works on single-objective approaches than on multi-objective optimization methods, it is observed that multi-objective optimization methods for FS problems based on metaheuristics techniques are not sufficiently examined [
11]. Non-dominated sorting GA II (NSGA-II) or its variants form the basis of the majority of multi-objective techniques [
33,
34,
35,
36]. Other evolutionary computation (EC) approaches used in multi-objective feature selection include DE [
37], ACO [
38], and PSO [
39]. According to the results of all these studies, multi-objective optimization algorithms outperform single-objective techniques in terms of both the quantity of features needed for supervised learning and classification performance. However, the existing literature has predominantly focused on datasets of modest to intermediate scale, indicating that the multi-objective feature-selection problem remains an unexplored field of study for unsupervised learning, much like high-dimensional data clustering. In addition, given that multi-objective evolutionary computation algorithms used for the FS problem are based on conventional algorithms like ACO, PSO, and GA, which typically have significant drawbacks such as slow convergence rate, high computational complexity, and trapping into local optima, there is also a need to investigate a novel multi-objective algorithm’s capability to handle the feature-selection problems [
33,
40].
Although feature selection has been studied extensively, the literature review indicates that multi-objective unsupervised learning for two problems—unsupervised feature selection and clustering—has received relatively less attention. Furthermore, existing multi-objective research faces many of the aforementioned issues and has not addressed large-scale datasets such as TDT TREC data. This study proposes a multi-objective algorithm with a wrapper-based approach for data clustering, taking into account the shortcomings of the existing literature and the benefits of the SOS method. To the best of our knowledge, this study is the first to employ a multi-objective SOS algorithm to find the best possible unsupervised feature combination that maximizes clustering performance while decreasing the total number of selected features for a given set of data. To show the suggested method’s robustness and dependability, it is evaluated using popular datasets from benchmark datasets. The results obtained are compared with the current approaches for both datasets, and the contribution related to the solution quality is given. The results of the study demonstrated that the proposed method performed better in terms of its capacity to provide acceptable outcomes, which included both an improvement in clustering performance and a reduction in the number of selected features. The robustness of this method is demonstrated by the better results it yields for both datasets. This work also examines and applies many SOS algorithm variants. The findings of these algorithms are compared with one another, and their benefits and drawbacks are identified.
The rest of the paper is organized as follows:
Section 2 presents a review of related works, covering the background of the SOS algorithm, global-search unsupervised feature-selection algorithms based on SOS methods, and the clustering algorithm utilizing SOS algorithms.
Section 3 outlines the proposed methods for this study.
Section 4 and
Section 5 detail the experimental settings and results, respectively. Finally, the conclusion of the work is provided in
Section 6.
3. Proposed Method
Unsupervised feature selection plays a crucial role in mitigating the curse of dimensionality, particularly in tasks like document clustering where high-dimensional text data are involved. The role of feature selections is multifaceted. They serve purposes such as enhancing performance (e.g., accuracy), aiding data visualization, simplifying model selection, and minimizing dimensionality by eliminating noise and unnecessary attributes [
61,
62].
In this study, a symbiotic organism search algorithm (SOS) was developed to solve numerical optimization over a continuous search space. The proposed SOS algorithm, like other population-based methods, iteratively employs a population of candidate solutions within the search space to find the optimal global solution. SOS starts with an initial population (referred to as the ecosystem), where organisms are randomly generated. Each organism represents a candidate solution, and its associated fitness value reflects its adaptation to the desired objective. This approach models the ecological interaction between two organisms in the ecosystem to control the production of new solutions. The three phases, including parasitism, commensalism, and mutualism, which resemble the real-world biological interaction framework, are shown.
The nature of interactions determines the primary principle for each phase. In the mutualism phase, interactions benefit both sides. In the commensalism phase, one side benefits without affecting the other. In the parasitism phase, one side benefits while actively harming the other. Throughout all stages, interactions between the organisms are random and continue until the termination conditions are met. The SOS algorithm processes are described in full in the following algorithm, and further information about the three phases as provided in the next section include
3.1. Mutualism Phase
An illustrative example of mutualism, which benefits both participating organisms, is the symbiotic relationship between bees and flowers. Bees actively fly among flowers, collecting nectar that they transform into honey—a process beneficial to the bees themselves. Simultaneously, this activity also benefits the flowers, as bees inadvertently distribute pollen during their foraging, facilitating pollination. In the context of the SOS phase, this mutualistic interaction serves as a model.
In SOS,
is an organism matched to the
ith member of the ecosystem. Another organism
is then selected randomly from the ecosystem to interact with
Both organisms engage in a mutualistic relationship to increase the mutual survival advantage in the ecosystem. New candidate solutions for
and
are calculated based on the mutualistic symbiosis between organism X
i and
, which is modeled in Equations (10) and (11).
rand(0,1) in Equations (10) and (11) is a vector of random numbers.
What follows explains the function of BF1 and BF2. In the natural world, certain mutualistic connections may benefit one organism more than another. In another context, interactions with organism B may be extremely advantageous for organism A. When interacting with organism A, organism B may only receive minimal or insignificant benefits. In this case, benefit factors (BF1 and BF2) are arbitrarily assigned to 1 or 2. These variables indicate the extent to which each organism benefits from the contact—that is, whether one organism gains all or some benefit from it.
The relationship feature between organisms and is represented by a vector named ‘Mutual_Vector’, as shown in Equation (12). The mutualistic effort to accomplish their objective of enhancing survival advantage is reflected in the (Xbest-Mutual Vector BF1) component of the equation. Moreover, all species are compelled to increase their degree of adaptation to their ecosystem, based on Darwin’s theory of evolution, which states that ‘only the fittest organisms will prevail’. Some of them enhance their adaption to survival by forming symbiotic relationships with other organisms. Since Xbest represents the maximum level of adaptation, it is required in this scenario. Consequently, we model the highest degree of adaptation as the objective point for the fitness increment of both organisms using (Xbest/global solution). In the end, organisms are only updated if their current fitness exceeds their fitness before the interaction.
3.2. Commensalism Phase
A common example of the interaction between remora fish and sharks can be used to define commensalism. In such a scenario, the remora gains the advantage when it clings to the shark and consumes its remaining food. The behaviors of remora fish do not affect the shark, and their relationship offers very little benefit to it.
As in the mutualism phase, an organism denoted
Xj is chosen at random from the environment to engage in interactions with
Xi. In this situation, organism
Xi makes an effort to gain something from the exchange. However, the relationship does not benefit or harm organism
Xj itself. The commensal symbiosis between organisms
Xi and
Xj, which is described in Equation (13), is used to determine the new candidate solution of
Xi. By the rules, organism
Xi is modified only if its current fitness exceeds its fitness before the interaction.
The portion of the equation denoted by (Xbest − Xj) reflects a positive advantage that Xj offers by helping Xi maximize its survival advantage within ecosystems in the current organism (represented by Xbest).
3.3. Parasitism Phase
The Plasmodium parasite, which spreads between human hosts using its connection with the Anopheles mosquito, is a good example of parasitism. Whereas the parasites grow and replicate in the human body, their human host may suffer malaria and die as a result. By creating a synthetic parasite known as ‘Parasite Vector’, SOS gives organism Xi a function like that of the Anopheles mosquito. Using a random value to adjust the randomly chosen dimensions, organism Xi is duplicated in the search space to form a parasite vector. The host for the parasite vector is an organism denoted by Xj, which is chosen at random from the ecosystem. Xj is being replaced by the parasite vector in the ecosystem. Consequently, the fitness of both organisms is determined.
If the parasite vector has higher fitness values than organism Xj, it will be eliminated and its place in the ecosystem will be taken over. In other words, if the fitness value of Xj is higher, the parasite will not be able to survive in that ecosystem since Xj will be resistant to it.
3.4. Development of Initial Features
The values of selected features appear to be organized as an array. In optimization terminology, particle position corresponds to this array in particle swarm optimization (PSO), while genetic algorithms refer to it as a ‘Chromosome’. As a result, the proposed approach labels each individual feature as a ‘Raindrop’ feature. In the problem selection of N
var dimensional features, a raindrop represents an array of 1 × N
var. Such an array is explained as follows:
At the beginning of the feature selection, a candidate representative of a matrix of size raindrops N
pop × N
var is created (i.e., features raindrops). Then, the matrix X is randomly created and provided as follows (columns and rows are the quantity of the variable of design and the quantity of unsupervised feature selections):
Every value of the decision variable (X
1, X
2, X
3… X
Nvar) can be described as the following numbers (0 or 1), where N
vars and N
pop are the number of design variables and the number of raindrops (preliminary unsupervised features selection), respectively. Moreover, N
pop raindrops are generated, and subsequently, the raindrop cost is obtained by the assessment of the function of cost (
Cost) as follows.
3.5. Cost of Solutions
As previously established, each row in eco is associated with many features in the document. In the context of eco a row’s set of features is represented is denoted by f = (
,
…
). The objective for each row in eco is to assess the mean absolute difference (MAD), as detailed in [
61]. MAD is used to determine the most relevant features for text classification by correlating the scores with the importance of each feature. The aim is to assign scores that accurately reflect the significance of each feature. One way to obtain such a score is to take the difference between the mean values and the sample. It can be depicted as per the following equation:
where
is the value of feature
i in accordance with the document
j and
is the mean of the feature
i, which is computed according to the equation as follows:
Every element in the solution indicates the cluster number as C = (
,
…
), and the computation of each solution in eco corresponds to a document cluster. The set of
K centroids that correspond to a row in eco is represented by the C. The centroid of the
kth cluster can be calculated as follows:
= (
…,).
The goal is to verify that by minimizing distances within and between clusters, cluster centroids optimize similarity both within and between clusters. The row corresponds to the average distance of documents from the cluster centroid and the associated fitness value. Following this, a suitable solution is derived based on this information. The condition is commonly known as attention deficit disorder with hyperactivity (ADDC).
The cosine similarity is denoted by the D (.,.), where is the jth document in cluster i, K represents the number of clusters, and is the number of documents contained in cluster i (e.g., ( = )). If the cost value of the locally optimized vector is higher than that of the eco solutions, the newly solution generated can be replaced within a row in eco.