An Improved K-Means Algorithm Based on Contour Similarity
Abstract
:1. Introduction
- (1)
- Clustering results are susceptible to noisy data and outliers;
- (2)
- Random selection of initial centers tends to make clustering results locally optimal;
- (3)
- Realistic clustering application scenarios where the number of clusters is predetermined in advance are difficult;
- (4)
- Furthermore, the reliance on the Euclidean distance in the Classic K-means algorithm limits its capacity to discern non-linear cluster shapes or clusters with irregular formations.
2. Literature Survey
3. Pan-Factor Space Theory and Contour Similarity
- (1)
- The coordinates of the object on the factor are natural numbers and when represent no actual statistical values, i.e., missing values.
- (2)
- For , the magnitudes of and represent only the order of precedence and have no quantitative meaning. In general, is meaningless, but represents the potential difference between two objects and on factor .
- (3)
- For , and are not directly comparable.
- If , then .
- If , then .
- If , then .
- If , then .
- If not, then .
- (1)
- Boundedness ;
- (2)
- Regularity ;
- (3)
- Symmetry ;
- (4)
- Rank Preservation if , then .
4. Algorithmic Ideas and Improvements
4.1. The Lattice Transformation Process
- (1)
- For the original data matrix , extract the number of the matrix’s rows and the number of the matrix’s columns.
- (2)
- Find the reference point vector
- (3)
- For , let
4.2. The Similarity Calculation Process of Contour Similarity
- (1)
- Calculate the scale vector
- (2)
- Generate the contour data matrix
- (3)
- Measure the potential of each sample
- (4)
- Calculate the pose metric for each sample.
- (5)
- Calculate the factor contour distance matrix for the sample.
- (6)
- Calculate the contour similarity matrix between individual samples.
4.3. Design of k-Means Algorithm Based on Contour Similarity
- (1)
- Randomly select k samples in the sample data set as the initial center of mass;
- (2)
- The distance from each sample to these k centers of mass is calculated using the Euclidean distance;
- (3)
- Divide each sample into the nearest center of mass to form a collection of classes;
- (4)
- Update the center of mass of the k classes;
- (5)
- Repeat steps 1–4 until the number of iterations is reached or the center of mass no longer changes between the two preceding and following times;
- (6)
- Output the k classes of the division.
Algorithm 1: The CSk-means. |
Input represents a vector of sample data), k value, Maximum Iterations |
Output: The k classes that have been classified. |
1. Select k sample points as the initial clustering centers. |
2. repeat |
3. for |
4. to each class center. |
5. Class labeling is decided based on maximum contour similarity. |
6. Divide the sample points into the appropriate clusters. |
7. end for |
8. Updating the clustering center. |
9. until the maximum number of iterations is reached or the clustering center no longer changes. |
5. Experimental Validation of the Csk-means Algorithm
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pérez-Ortega, J.; Almanza-Ortega, N.N.; Vega-Villalobos, A.; Pazos-Rangel, R.; Zavala-Díaz, C.; Martínez-Rebollar, A. The K-means algorithm evolution. Introd. Data Sci. Mach. Learn. 2019, 69–90. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 1, No. 14; pp. 281–297. [Google Scholar]
- Jancey, R.C. Multidimensional group analysis. Aust. J. Bot. 1966, 14, 127–130. [Google Scholar] [CrossRef]
- Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801. [Google Scholar]
- Kapoor, A.; Singhal, A. A comparative study of K-Means, K-Means++ and Fuzzy C-Means clustering algorithms. In Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 9–10 February 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Ezugwu, A.E.-S.; Agbaje, M.B.; Aljojo, N.; Els, R.; Chiroma, H.; Elaziz, M.A. A Comparative Performance Study of Hybrid Firefly Algorithms for Automatic Data Clustering. IEEE Access 2020, 8, 121089–121118. [Google Scholar] [CrossRef]
- Annas, M.; Wahab, S.N. Data Mining Methods: K-Means Clustering Algorithms. Int. J. Cyber IT Serv. Manag. 2023, 3, 40–47. [Google Scholar] [CrossRef]
- Hu, H.; Liu, J.; Zhang, X.; Fang, M. An Effective and Adaptable K-means Algorithm for Big Data Cluster Analysis. Pattern Recognit. 2023, 139, 109404. [Google Scholar] [CrossRef]
- Mussabayev, R.; Mladenovic, N.; Jarboui, B.; Mussabayev, R. How to use K-means for big data clustering? Pattern Recognit. 2023, 137, 109269. [Google Scholar] [CrossRef]
- Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 3rd ed.; Academic Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Guedes, P.C.; Müller, F.M.; Righi, M.B. Risk measures-based cluster methods for finance. Risk Manag. 2023, 25, 4. [Google Scholar] [CrossRef]
- Yudhistira, A.; Andika, R. Pengelompokan Data Nilai Siswa Menggunakan Metode K-Means Clustering. J. Artif. Intell. Technol. Inf. 2023, 1, 20–28. [Google Scholar] [CrossRef]
- Navarro, M.M.; Young, M.N.; Prasetyo, Y.T.; Taylar, J.V. Stock market optimization amidst the COVID-19 pandemic: Technical analysis, K-means algorithm, and mean-variance model (TAKMV) approach. Heliyon 2023, 9, 2–3. [Google Scholar] [CrossRef]
- Foster, J.; Gray, R.; Dunham, M. Finite-state vector quantization for waveform coding. IEEE Trans. Inf. Theory 1985, 31, 348–359. [Google Scholar] [CrossRef]
- Liaw, Y.-C.; Lo, W.; Lai, J.Z. Image restoration of compressed image using classified vector quantization. Pattern Recognit. 2002, 35, 329–340. [Google Scholar] [CrossRef]
- Zhu, A.; Hua, Z.; Shi, Y.; Tang, Y.; Miao, L. An improved K-means algorithm based on evidence distance. Entropy 2021, 23, 1550. [Google Scholar] [CrossRef] [PubMed]
- Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Li, Y.; Wu, H. A clustering method based on K-means algorithm. Phys. Procedia 2012, 25, 1104–1109. [Google Scholar] [CrossRef]
- Singh, A.; Yadav, A.; Rana, A. K-means with Three different Distance Metrics. Int. J. Comput. Appl. 2013, 67, 1–3. [Google Scholar] [CrossRef]
- Chakraborty, A.; Faujdar, N.; Punhani, A.; Saraswat, S. Comparative Study of K-Means Clustering Using Iris Data Set for Various Distances. In Proceedings of the 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 29–31 January 2020. [Google Scholar]
- Sebayang, F.A.; Lydia, M.S.; Nasution, B.B. Optimization on Purity K-Means Using Variant Distance Measure. In Proceedings of the 2020 3rd International Conference on Mechanical, Electronics, Computer, and Industrial Technology (MECnIT), Medan, Indonesia,, 25–27 June 2020; pp. 143–147. [Google Scholar]
- Tang, Z.K.; Zhu, Z.Y.; Yang, Y.; Caihong, L.; Lian, L. DK-means algorithm based on distance and density. Appl. Res. Comput. 2020, 37, 1719–1723. [Google Scholar]
- Wang, Z.L.; Li, J.; Song, Y.F. Improved K-means algorithm based on distance and weight. Comput. Eng. Appl. 2020, 56, 87–94. [Google Scholar]
- Wang, Y.; Luo, X.; Zhang, J.; Zhao, Z.; Zhang, J. An Improved Algorithm of K-means Based on Evolutionary Computation. Intell. Autom. Soft Comput. 2020, 26, 961–971. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, D.; Shi, H. K-means clustering based on self-adaptive weight. In Proceedings of the 2012 2nd International Conference on Computer Science and Network Technology, Changchun, China, 29–31 December 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
- Chen, A.; Yang, Y. Diffusion K-means clustering on manifolds: Provable exact recovery via semidefinite relaxations. Appl. Comput. Harmon. Anal. 2021, 52, 303–347. [Google Scholar] [CrossRef]
- Dinh, D.-T.; Huynh, V.-N.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]
- Bao, Y. Contour similarity and metric of samples of finite dimensional state vector. J. Liaoning Tech. Univ. 2011, 30, 603–660. [Google Scholar]
- Zhao, F.; Sun, M.; Bao, Y. Similarity Measure of Geometric Contours about Multi-Sale Data and Its Application. Math. Pract. Theory 2013, 43, 178–182. [Google Scholar]
- Fisher, R.A. Iris. UCI Machine Learning Repository. 1988. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 21 April 2024).
- Aeberhard, S.; Forina, M. Wine. UCI Machine Learning Repository. 1991. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 21 April 2024).
- Nakai, K. Ecoli. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/39/ecoli (accessed on 21 April 2024).
- Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.; Lukasik, S. Seeds. UCI Machine Learning Repository. 2012. Available online: https://archive.ics.uci.edu/dataset/236/seeds (accessed on 21 April 2024).
- Nakai, K. Yeast. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/110/yeast (accessed on 21 April 2024).
- Nash, W.; Sellers, T.; Talbot, S.; Cawthorn, A.; Ford, W. Abalone. UCI Machine Learning Repository. 1995. Available online: https://archive.ics.uci.edu/dataset/1/abalone (accessed on 21 April 2024).
Datasets | No. of Samples | No. of Features | No. of Clusters |
---|---|---|---|
Iris [32] | 150 | 4 | 3 |
Wine [33] | 178 | 13 | 3 |
Ecoli [34] | 336 | 8 | 8 |
Seeds [35] | 420 | 7 | 3 |
Yeast [36] | 1484 | 8 | 10 |
Abalone [37] | 4177 | 8 | 28 |
CSk-Means | k-Means | CSk-Means | k-Means | CSk-Means | k-Means | |
---|---|---|---|---|---|---|
Datasets | Time Spent | Number of Iterations | Accuracy | |||
Iris | 0.0076 | 0.0064 | 3 | 4 | 0.92 | 0.89 |
Wine | 0.0754 | 0.0182 | 7 | 5 | 0.94 | 0.70 |
Ecoli | 0.0822 | 0.0418 | 10 | 23 | 0.76 | 0.58 |
Seeds | 0.4331 | 0.2781 | 6 | 7 | 0.90 | 0.89 |
yeast | 0.1946 | 0.0358 | 10 | 40 | 0.42 | 0.40 |
Abalone | 0.2286 | 0.2092 | 30 | 54 | 0.18 | 0.22 |
CSk-Means | k-Means | CSk-Means | k-Means | CSk-Means | k-Means | |
---|---|---|---|---|---|---|
Datasets | F | P | R | |||
Iris | 0.94 | 0.92 | 0.94 | 0.88 | 0.94 | 0.98 |
Wine | 0.94 | 0.78 | 0.95 | 0.83 | 0.97 | 0.74 |
Ecoli | 0.76 | 0.57 | 0.76 | 0.90 | 0.76 | 0.42 |
Seeds | 0.91 | 0.91 | 0.92 | 0.91 | 0.91 | 0.92 |
yeast | 0.40 | 0.35 | 0.35 | 0.34 | 0.50 | 0.36 |
Abalone | 0.30 | 0.24 | 0.33 | 0.29 | 0.28 | 0.21 |
Csk-Means | Hierarchical Clustering | Spectral Clustering | k-Means++ | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Datasets | Accuracy | F | P | R | Accuracy | F | P | R | Accuracy | F | P | R | Accuracy | F | P | R |
iris | 0.92 | 0.94 | 0.94 | 0.94 | 0.35 | 0.51 | 0.34 | 1.00 | 0.91 | 0.94 | 0.88 | 1.00 | 0.91 | 0.92 | 0.88 | 0.98 |
wine | 0.94 | 0.96 | 0.95 | 0.97 | 0.43 | 0.60 | 0.61 | 0.58 | 0.32 | 0.49 | 0.33 | 0.93 | 0.74 | 0.80 | 0.86 | 0.76 |
Ecoli | 0.76 | 0.76 | 0.76 | 0.76 | 0.45 | 0.61 | 0.44 | 0.98 | 0.56 | 0.55 | 0.85 | 0.41 | 0.60 | 0.57 | 0.91 | 0.43 |
Seeds | 0.90 | 0.91 | 0.92 | 0.91 | 0.92 | 0.94 | 0.93 | 0.95 | 0.90 | 0.92 | 0.91 | 0.94 | 0.91 | 0.93 | 0.93 | 0.94 |
yeast | 0.42 | 0.40 | 0.35 | 0.50 | 0.32 | 0.48 | 0.31 | 0.97 | 0.29 | 0.01 | 0.05 | 0.00 | 0.41 | 0.36 | 0.35 | 0.36 |
Abalone | 0.18 | 0.30 | 0.33 | 0.28 | 0.22 | 0.33 | 0.29 | 0.36 | 0.13 | 0.22 | 0.31 | 0.17 | 0.23 | 0.25 | 0.30 | 0.21 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, J.; Bao, Y.; Li, D.; Guan, X. An Improved K-Means Algorithm Based on Contour Similarity. Mathematics 2024, 12, 2211. https://doi.org/10.3390/math12142211
Zhao J, Bao Y, Li D, Guan X. An Improved K-Means Algorithm Based on Contour Similarity. Mathematics. 2024; 12(14):2211. https://doi.org/10.3390/math12142211
Chicago/Turabian StyleZhao, Jing, Yanke Bao, Dongsheng Li, and Xinguo Guan. 2024. "An Improved K-Means Algorithm Based on Contour Similarity" Mathematics 12, no. 14: 2211. https://doi.org/10.3390/math12142211
APA StyleZhao, J., Bao, Y., Li, D., & Guan, X. (2024). An Improved K-Means Algorithm Based on Contour Similarity. Mathematics, 12(14), 2211. https://doi.org/10.3390/math12142211