Similarity-Based Three-Way Clustering by Using Dimensionality Reduction
Abstract
:1. Introduction
- (1)
- Ensemble three-way clustering framework based on dimensionality reduction.
- (2)
- Integration of co-occurrence frequency, hierarchical clustering, and lifecycle analysis:
2. Related Work
2.1. Three-Way Clustering
2.2. PCA Dimensionality Reduction
2.3. K-Means Algorithm
2.4. Hierarchical Clustering
2.5. Clustering Ensemble and Co-Association Frequency
3. Similarity-Based Three-Way Clustering by Using Dimensionality Reduction
3.1. Dimensionality Reduction
Algorithm 1: Foundational Clustering Based on Data Dimensionality Reduction |
3.2. Clustering Ensemble
Algorithm 2: Ensemble Clustering Results | |
Input: Reduced data matrix | |
Output: | |
1 | Compute the co-occurrence frequency matrix by (2). |
2 | Obtain the single-linkage dendrogram of . |
3 | Achieve ensemble clustering results with the highest lifetime. |
4 | Return ensemble clustering results . |
3.3. Similar Classes Based on Co-Association Frequency
Algorithm 3: Finding core region and fringe region |
3.4. Similarity-Based Three-Way Clustering by Using Dimensionality Reduction
Algorithm 4: Similarity-based three-way clustering algorithm |
Input: Original data matrix , Number of iteratiber of clusters Dimensionality reduction method (PCA), Threshold |
Output: |
|
4. Experimental Analyses
4.1. Data Descriptions
4.2. Evaluation Indices
- (1)
- (2)
- Adjusted Mutual Information (AMI) [63,64] is an internal metric commonly used to assess the performance of clustering results. It is designed to measure the similarity between clustering results and a ground truth (typically, actual labels) by quantifying the information gain between two distributions.
- (3)
- Accuracy (ACC) [65] is a common metric used to assess the performance of a classification model. It measures the proportion of samples that the model correctly classifies and serves as a simple and intuitive performance indicator. The formula for calculating ACC is as follows:
4.3. Experimental Performances
Datasets | Algorithm | ARI | AMI | ACC |
---|---|---|---|---|
Seeds | K-means | 0.7500 | 0.7054 | 0.9095 |
FCM | 0.7161 | 0.6915 | 0.8952 | |
DBSCAN | 0.7021 | 0.4396 | 0.3667 | |
Ours | 0.8198 | 0.7685 | 0.9356 | |
Credit | K-means | 0.0091 | 0.032 | 0.3741 |
FCM | 0.0272 | 0.0317 | 0.3917 | |
DBSCAN | 0.0110 | 0.0006 | 0.389 | |
Ours | 0.1116 | 0.1073 | 0.3987 | |
Ionosphere | K-means | 0.011 | 0.0006 | 0.5783 |
FCM | 0.1713 | 0.1272 | 0.7094 | |
DBSCAN | 0.2174 | 0.1426 | 0.3932 | |
Ours | 0.2500 | 0.2017 | 0.7833 | |
Libras | K-means | 0.1837 | 0.1842 | 0.3389 |
FCM | 0.0597 | 0.33 | 0.1778 | |
DBSCAN | 0.0025 | 0.2215 | 0.1000 | |
Ours | 0.5193 | 0.7144 | 0.6182 | |
Ecoil | K-means | 0.4542 | 0.5709 | 0.5565 |
FCM | 0.3679 | 0.5619 | 0.497 | |
DBSCAN | 0.0080 | 0.005 | 0.4256 | |
Ours | 0.3937 | 0.4999 | 0.6100 | |
Segmentation | K-means | 0.0331 | 0.0736 | 0.2455 |
FCM | 0.3875 | 0.5062 | 0.6100 | |
DBSCAN | 0.1067 | 0.3301 | 0.2939 | |
Ours | 0.5501 | 0.6996 | 0.6720 | |
Thyroid | K-means | 0.2145 | 0.3911 | 0.5721 |
FCM | 0.4294 | 0.176 | 0.786 | |
DBSCAN | 0.3123 | 0.0356 | 0.4465 | |
Ours | 0.5964 | 0.5628 | 0.8950 | |
Wdbc | K-means | 0.0019 | 0.0052 | 0.5202 |
FCM | 0.7299 | 0.6138 | 0.9279 | |
DBSCAN | 0.0274 | 0.0145 | 0.6098 | |
Ours | 0.6441 | 0.5295 | 0.9320 | |
Wine | K-means | 0.4483 | 0.4485 | 0.6461 |
FCM | 0.3492 | 0.4075 | 0.6854 | |
DBSCAN | 0.2700 | 0.3137 | 0.5169 | |
Ours | 0.5831 | 0.6674 | 0.8118 |
- (1).
- By comparing the performance of our proposed three-way clustering algorithm with traditional clustering methods, such as k-means, FCM (Fuzzy C-Means), and DBSCAN (Density-Based Spatial Clustering of Applications with Noise), on AMI, ARI, and ACC, it can be found that our proposed algorithm demonstrates significant advantages on most datasets. Taking the Libras dataset as an example, after running the proposed algorithm, the resulting AMI, ARI, and ACC values are 0.5193, 0.7144, and 0.6182, respectively. In contrast, the AMI, ARI, and ACC values for the traditional k-means algorithm are only 0.1837, 0.1842, and 0.3389, respectively. This improvement is attributed to the dimensionality reduction of original high-dimensional data, mapping it to a lower-dimensional space, thus reducing data complexity. The introduction of co-occurrence probability enables more precise delineation of similar classes, allocating data points to core and fringe regions, better capturing the inherent structure of the data.
- (2).
- By comparing the proposed three-way clustering algorithm with other algorithms in terms of AMI, ARI, and ACC, we observed significant improvements in the proposed algorithm relative to others. Specifically, across all datasets, the proposed algorithm exhibited an average improvement of approximately 20% to 30% in ARI and ACC, and an average increase of about 15% to 35% in AMI. There are several potential reasons behind these improvements. Firstly, the proposed three-way clustering algorithm adopts an ensemble strategy, integrating concepts of data dimensionality reduction, co-occurrence frequencies, and similarity classes, thereby offering a more comprehensive consideration of the inherent structure of the data. Secondly, leveraging the single-linkage method of hierarchical clustering, the proposed three-way clustering algorithm effectively captures the degree of correlation among data points, resulting in more precise classification of data points into clusters. Additionally, by selecting the clustering result with the highest lifetime as the final merged result, the proposed three-way algorithm ensures the stability and consistency of the clustering results, rendering it more suitable for various data types and complex structures. The suboptimal performance on the Wdbc dataset may be due to algorithm sensitivity to different parameter settings, and parameter selection may vary across different datasets. Although our proposed algorithm shows significant improvements, certain algorithms may perform better under specific conditions due to their inherent characteristics. For example, algorithms like DBSCAN are particularly effective for datasets with noise and density variations, while hierarchical clustering can capture nested cluster structures. By comparing the actual runtime with the computational time complexity, it is concluded that the proposed algorithm strikes a balance between accuracy and computational efficiency. Although it is not the fastest, its robustness and ability to handle high-dimensional and noisy data make it a valuable tool in practical applications.
5. Conclusions
- (1).
- Adaptability of parameter selection:
- (2).
- Improving the Quality of Base Clustering:
- (3).
- Adaptation Improvements for Specific Datasets:
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yang, X.; Yao, Y. Ensemble selector for attribute reduction. Appl. Soft Comput. 2018, 70, 1–11. [Google Scholar] [CrossRef]
- Jiang, Z.; Yang, X.; Yu, H.; Liu, D.; Wang, P.; Qian, Y. Accelerator for multi-granularity attribute reduction. Knowl.-Based Syst. 2019, 177, 145–158. [Google Scholar] [CrossRef]
- Li, J.; Yang, X.; Song, X.; Li, J.; Wang, P.; Yu, D.J. Neighborhood attribute reduction: A multi-criterion approach. Int. J. Mach. Learn. Cybern. 2019, 10, 731–742. [Google Scholar] [CrossRef]
- Liu, K.; Yang, X.; Yu, H.; Fujita, H.; Chen, X.; Liu, D. Supervised information granulation strategy for attribute reduction. Int. J. Mach. Learn. Cybern. 2020, 11, 2149–2163. [Google Scholar] [CrossRef]
- Xu, S.; Yang, X.; Yu, H.; Yu, D.-J.; Yang, J.; Tsang, E.C. Multi-label learning with label-specific feature reduction. Knowl.-Based Syst. 2016, 104, 52–61. [Google Scholar] [CrossRef]
- Liu, K.; Yang, X.; Fujita, H. An efficient selector for multi-granularity attribute reduction. Inf. Sci. 2019, 505, 457–472. [Google Scholar] [CrossRef]
- Liu, K.; Yang, X.; Yu, H.; Mi, J.; Wang, P.; Chen, X. Rough set based semi-supervised feature selection via ensemble selector. Knowl.-Based Syst. 2020, 165, 282–296. [Google Scholar] [CrossRef]
- Xu, M.; Li, C.; Zhang, S.; Le Callet, P. State-of-the-art in 360 video/image processing: Perception, assessment and compression. IEEE J. Sel. Top. Signal Process. 2020, 14, 5–26. [Google Scholar] [CrossRef]
- Tov, O.; Alaluf, Y.; Nitzan, Y.; Patashnik, O.; Cohen-Or, D. Designing an encoder for StyleGAN image manipulation. ACM Tran. Graph. 2021, 40, 1–14. [Google Scholar] [CrossRef]
- Yang, X.; Qi, Y.; Song, X.; Yang, J. Test cost sensitive multigranulation rough set: Model and minimal cost selection. Inf. Sci. 2013, 250, 184–199. [Google Scholar] [CrossRef]
- Xu, W.; Guo, Y. Generalized multigranulation double-quantitative decision-theoretic rough set. Knowl.-Based Syst. 2016, 105, 190–205. [Google Scholar] [CrossRef]
- Li, W.; Xu, W.; Zhang, X.; Zhang, J. Updating approximations with dynamic objects based on local multigranulation rough sets in ordered information systems. Artif. Intell. Rev. 2021, 55, 1821–1855. [Google Scholar] [CrossRef]
- Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
- Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
- Coley, C.W.; Jin, W.; Rogers, L.; Jamison, T.F.; Jaakkola, T.S.; Green, W.H.; Barzilay, R.; Jensen, K.F. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 2019, 10, 370–377. [Google Scholar] [CrossRef] [PubMed]
- Strehl, A.; Ghosh, J. Cluster ensembles-a knowledge reuse framework for combing multiple partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar]
- Fred, A.L.N.; Jain, A.K. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 835–850. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Z.; Tang, W. Cluster ensemble. Knowl.-Based Syst. 2006, 19, 77–83. [Google Scholar] [CrossRef]
- Huang, D.; Lai, J.; Wang, C. Ensemble clustering using factor graph. Pattern Recognit. 2016, 50, 131–142. [Google Scholar] [CrossRef]
- Huang, D.; Wang, C.; Lai, J. Locally weighted ensemble clustering. IEEE Trans. Cybern. 2018, 48, 1460–1473. [Google Scholar] [CrossRef]
- Xu, L.; Ding, S. A novel clustering ensemble model based on granular computing. Appl. Intell. 2021, 51, 5474–5488. [Google Scholar] [CrossRef]
- Zhou, P.; Wang, X.; Du, L.; Li, X.J. Clustering ensemble via structured hypergraph learning. Inf. Fusion 2022, 78, 171–178. [Google Scholar] [CrossRef]
- Yao, Y. The superiority of three-way decisions in probabilistic rough set models. Inf. Sci. 2011, 81, 1080–1096. [Google Scholar] [CrossRef]
- Yao, Y. Three-way decisions and cognitive computing. Cogn. Comput. 2016, 8, 543–554. [Google Scholar] [CrossRef]
- Yao, Y. Tri-level thinking: Models of three-way decision. Int. J. Mach. Learn. Cybern. 2020, 11, 947–959. [Google Scholar] [CrossRef]
- Yao, Y. The geometry of three-way decision. Appl. Intell. 2021, 51, 6298–6325. [Google Scholar] [CrossRef]
- Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 314–356. [Google Scholar] [CrossRef]
- Yang, X.; Liang, S.; Yu, H.; Gao, S.; Qian, Y. Pseudo-label neighborhood rough set: Measures and attribute reductions. Int. J. Approx. Reason. 2019, 105, 112–129. [Google Scholar] [CrossRef]
- Dou, H.; Yang, X.; Song, X.; Yu, H.; Wu, W.Z.; Yang, J. Decision-theoretic rough set: A multi-cost strategy. Knowl.-Based Syst. 2016, 91, 71–83. [Google Scholar] [CrossRef]
- Darwiche, A. Bayesian networks. Found. Artif. Intell. 2008, 3, 467–509. [Google Scholar]
- Daly, R.; Shen, Q.; Aitken, S. Learning Bayesian networks: Approaches and issues. Knowl. Eng. Rev. 2011, 26, 99–157. [Google Scholar] [CrossRef]
- Yang, X.; Liu, D.; Yang, X.; Liu, K.; Li, T. Incremental fuzzy probability decision-theoretic approaches to dynamic three-way approximations. Inf. Sci. 2021, 550, 71–90. [Google Scholar] [CrossRef]
- Li, C.; Zhou, J.; Kou, P.; Xiao, J. A novel chaotic particle swarm optimization based fuzzy clustering algorithm. Neurocomputing 2012, 83, 98–109. [Google Scholar] [CrossRef]
- Yu, H. A framework of three-way cluster analysis. In Proceedings of the International Joint Conference on Rough Sets, Olsztyn, Poland, 3–7 July 2017; pp. 300–312. [Google Scholar]
- Wu, T.; Fan, J.; Wang, P. An improved three-way clustering based on ensemble strategy. Mathematics 2022, 10, 1457. [Google Scholar] [CrossRef]
- Wang, P.; Yang, X. Three-way clustering method based on stability theory. IEEE Access 2021, 9, 33944–33953. [Google Scholar] [CrossRef]
- Wang, P.; Shi, H.; Yang, X.; Mi, J. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern. 2019, 10, 2767–2777. [Google Scholar] [CrossRef]
- Fan, J.; Wang, P.; Jiang, C.; Yang, X.; Song, J. Ensemble learning using three-way density-sensitive spectral clustering. Int. J. Approx. Reason. 2022, 149, 70–84. [Google Scholar] [CrossRef]
- Wang, P.; Yang, X.; Ding, W.; Zhan, J.; Yao, Y. Three-way clustering: Foundations, survey and challenges. Appl. Soft Comput. 2024, 151, 111131. [Google Scholar] [CrossRef]
- Wang, P.; Yao, Y. CE3: A three-way clustering method based on mathematical morphology. Knowl.-Based Syst. 2018, 155, 54–65. [Google Scholar] [CrossRef]
- Li, F.; Qian, Y.; Wang, J.; Dang, C.; Jing, L. Clustering ensemble based on sample’s stability. Artif. Intell. 2019, 273, 37–55. [Google Scholar] [CrossRef]
- Yu, H.; Chang, Z.; Wang, G.; Chen, X. An efficient three-way clustering algorithm based on gravitational search. Int. J. Mach. Learn. Cyber. 2020, 11, 1003–1016. [Google Scholar] [CrossRef]
- Jia, X.; Rao, Y.; Li, W.; Yang, S.; Yu, H. An automatic three-way clustering method based on sample similarity. Int. J. Mach. Learn. Cybern. 2021, 12, 1545–1556. [Google Scholar] [CrossRef]
- Wang, P.; Wu, T.; Yao, Y. A three-way adaptive density peak clustering (3W-ADPC) method. Appl. Intell. 2023, 53, 23966–23982. [Google Scholar] [CrossRef]
- Vittoria, B.; Lucia, M.C.; Domenico, V. A short review on minimum description length: An application to dimension reduction in PCA. Entropy 2022, 24, 269. [Google Scholar] [CrossRef] [PubMed]
- Goparaju, B.; Rao, S.B. A DDoS attack detection using PCA dimensionality reduction and support vector machine. Int. J. Commun. Netw. Inf. Sec. 2022, 14, 1–8. [Google Scholar] [CrossRef]
- Boushaba, A.; Cauet, S.; Chamroo, A.; Etien, E.; Rambault, L. Comparative study between physics-informed CNN and PCA in induction motor broken bars MCSA Detection. Sensors 2022, 22, 9494. [Google Scholar] [CrossRef] [PubMed]
- Geophys, J. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar]
- Ng, A.; Jordan, M.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 849–856. [Google Scholar]
- Ford, J.K.; MacCallum, R.C.; Tait, M. The application of exploratory factor analysis in applied psychology: A critical review and analysis. Pers. Psychol. 1986, 39, 291–314. [Google Scholar] [CrossRef]
- Torgerson, W.S. Multidimensional scaling: I. Theory and method. Psychometrika 1952, 17, 401–419. [Google Scholar] [CrossRef]
- Yu, H.; Jiao, P.; Yao, Y.Y.; Wang, G.Y. Detecting and refining overlapping regions in complex networks with three-way decisions. Inf. Sci. 2016, 373, 21–41. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. Berkeley Symp. Math. Stat. Probab. 1967, 5, 281–297. [Google Scholar]
- Unrau, R.C.; Krieger, O.; Gamsa, B.; Stumm, M. Hierarchical clustering: A structure for scalable multiprocessor operating system design. J. Supercomput. 1995, 9, 105–134. [Google Scholar] [CrossRef]
- Shi, N.; Liu, X.; Guan, Y. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 63–67. [Google Scholar]
- Dimitriadou, E.; Weingessel, A.; Hornik, K. Voting-merging: An ensemble method for clustering. In Proceedings of the 2010 International Conference on Artificial Neural Networks, Vienna, Austria, 21–25 August 2001; pp. 217–224. [Google Scholar]
- Fan, J.; Wang, X.; Wu, T.; Zhu, J.; Wang, P. Three-way ensemble clustering based on sample’s perturbation theory. Mathematics 2022, 10, 2598. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, J.; Huang, Y.; Feng, X. Using hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Tran. Knowl. Data Eng. 2016, 28, 1919–1933. [Google Scholar] [CrossRef]
- Abdalrada, A.S.; Abawajy, J.; Al-Quraishi, T.; Islam, S.M.S. Machine learning models for prediction of co-occurrence of diabetes and cardiovascular diseases: A retrospective cohort study. J. Diab. Meta. Disord. 2022, 21, 251–261. [Google Scholar] [CrossRef] [PubMed]
- Bache, K.; Lichman, M. UCI Machine Learning Repository; University of California: Irvine, CA, USA, 2013. [Google Scholar]
- Shi, H.; Wang, P.; Yang, X.; Yu, H. An improved mean imputation clustering algorithm for incomplete data. Neural Process. Lett. 2021, 54, 3537–3550. [Google Scholar] [CrossRef]
- Jiang, W.; Xu, X.; Wen, Z.; Wei, L. Applying the similarity theory to model dust dispersion during coal-mine tunneling. Process Saf. Environ. 2021, 148, 415–427. [Google Scholar] [CrossRef]
- Hoffman, M.; Steinley, D.; Brusco, M.J. A note on using the adjusted rand index for link prediction in networks. Soc. Netw. 2015, 42, 72–79. [Google Scholar] [CrossRef]
- Steinley, D.; Brusco, J.M. A note on the expected value of the Rand index. Brit. J. Math. Stat. Psychol. 2018, 71, 287–299. [Google Scholar] [CrossRef]
- Amodio, S.; D’Ambrosio, A.; Iorio, C.; Siciliano, R. Adjusted concordance index: An extension of the adjusted rand index to fuzzy partitions. J. Classif. 2021, 38, 112–128. [Google Scholar]
1 | 1 | 0.5 | 0.25 | 0 | 0 | |
1 | 1 | 0.5 | 0.25 | 0 | 0 | |
0.5 | 0.5 | 1 | 0.75 | 0.25 | 0.25 | |
0.25 | 0.25 | 0.75 | 1 | 0.5 | 0.25 | |
0 | 0 | 0.25 | 0.5 | 1 | 1 | |
0 | 0 | 0.25 | 0.25 | 1 | 1 |
ID | Datasets | Numbers | Dimensions | Categories |
---|---|---|---|---|
1 | Seeds | 270 | 7 | 3 |
2 | Credit | 1493 | 9 | 3 |
3 | Ionosphere | 351 | 34 | 2 |
4 | Libras | 360 | 90 | 15 |
5 | Ecoil | 210 | 7 | 3 |
6 | Segmentation | 2310 | 19 | 7 |
7 | Thyroid | 215 | 9 | 3 |
8 | Wdbc | 569 | 30 | 2 |
9 | Wine | 178 | 13 | 3 |
10 | Waveform | 5000 | 40 | 3 |
11 | Iris | 150 | 4 | 3 |
12 | Yeast | 1484 | 8 | 10 |
13 | Dermatology | 366 | 34 | 6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, A.; Meng, Y.; Wang, P. Similarity-Based Three-Way Clustering by Using Dimensionality Reduction. Mathematics 2024, 12, 1951. https://doi.org/10.3390/math12131951
Li A, Meng Y, Wang P. Similarity-Based Three-Way Clustering by Using Dimensionality Reduction. Mathematics. 2024; 12(13):1951. https://doi.org/10.3390/math12131951
Chicago/Turabian StyleLi, Anlong, Yiping Meng, and Pingxin Wang. 2024. "Similarity-Based Three-Way Clustering by Using Dimensionality Reduction" Mathematics 12, no. 13: 1951. https://doi.org/10.3390/math12131951
APA StyleLi, A., Meng, Y., & Wang, P. (2024). Similarity-Based Three-Way Clustering by Using Dimensionality Reduction. Mathematics, 12(13), 1951. https://doi.org/10.3390/math12131951