Efficient High-Dimensional Kernel k-Means++ with Random Projection
Abstract
:1. Introduction
Contributions
2. Related Work
3. Our Method
3.1. Kernel k-Means with Random Projection
Algorithm 1: Kernel k-means with random projection. |
3.2. Efficient Initialization
Algorithm 2: Proposed function to generate random matrix. |
(0,1) generates D × d normal distributed random numbers between 0 to 1. |
3.2.1. k-Means++ with Fixed Random Projection (kmFRP)
3.2.2. k-means++ with Iterative Random Projection (kmIRP)
3.2.3. k-Means++ with Buffered Random Projection (kmBRP)
Algorithm 3: Proposed k-means++ approximation with different types of random projection. |
3.3. Complexity
4. Experiments
4.1. Datasets
4.2. Environments
4.3. Performance Metrics
4.4. Experiments on the Efficient Centroid Initialization
4.5. Experiments on the Kernel k-Means Clustering
4.6. Findings
4.6.1. Performance of Our Method for Centroid Initialization
4.6.2. Clustering Performance of Our Method
5. Discussions
5.1. Optimal Dimensionality of the Projected Space
5.2. Ablation Study on Our Method
5.3. Reliability, Stability and Robustness
5.3.1. Reliability
5.3.2. Robustness
5.3.3. Stability
6. Conclusions
6.1. Managerial and Academic Implications
6.2. Limitations
6.3. Future Studies and Recommendations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Paul, S.; Boutsidis, C.; Magdon-Ismail, M.; Drineas, P. Random Projections for Support Vector Machines. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics; Proceedings of Machine Learning Research; Carvalho, C.M., Ravikumar, P., Eds.; PMLR: Scottsdale, AZ, USA, 2013; Volume 31, pp. 498–506. [Google Scholar]
- Kabán, A. Improved Bounds on the Dot Product under Random Projection and Random Sign Projection. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., Williams, G., Eds.; ACM: New York, NY, USA, 2015; pp. 487–496. [Google Scholar] [CrossRef]
- Keivani, O.; Sinha, K. Random projection-based auxiliary information can improve tree-based nearest neighbor search. Inf. Sci. 2021, 546, 526–542. [Google Scholar] [CrossRef]
- Liberti, L.; Poirion, P.; Vu, K.K. Random projections for conic programs. arXiv 2021, arXiv:2101.04182. [Google Scholar]
- Heusinger, M.; Schleif, F. Random Projection in supervised non-stationary environments. In Proceedings of the 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2020, Bruges, Belgium, 2–4 October 2020; pp. 405–410. [Google Scholar]
- Cao, T.; Vo, C.; Nguyen, S.; Inoue, A.; Zhou, D. A Kernel k-Means-Based Method and Attribute Selections for Diabetes Diagnosis. J. Adv. Comput. Intell. Intell. Inform. 2020, 24, 73–82. [Google Scholar] [CrossRef]
- Paul, D.; Chakraborty, S.; Das, S.; Xu, J. Kernel k-Means, By All Means: Algorithms and Strong Consistency. arXiv 2020, arXiv:2011.06461. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms; SODA ’07; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
- Cardoso, Â.; Wichert, A. Iterative random projections for high-dimensional data clustering. Pattern Recognit. Lett. 2012, 33, 1749–1755. [Google Scholar] [CrossRef]
- Boutsidis, C.; Zouzias, A.; Mahoney, M.W.; Drineas, P. Stochastic Dimensionality Reduction for K-means Clustering. arXiv 2011, arXiv:1110.2897. [Google Scholar]
- Boutsidis, C.; Zouzias, A.; Drineas, P. Random Projections for k-means Clustering. arXiv 2010, arXiv:1011.4632v1. [Google Scholar]
- Boutsidis, C.; Zouzias, A.; Mahoney, M.W.; Drineas, P. Randomized Dimensionality Reduction for k-Means Clustering. IEEE Trans. Inform. Theory 2015, 61, 1045–1062. [Google Scholar] [CrossRef] [Green Version]
- Ding, C.; He, X. K-means Clustering via Principal Component Analysis. In Proceedings of the Twenty-first International Conference on Machine Learning; ICML ’04; ACM: New York, NY, USA, 2004; p. 29. [Google Scholar] [CrossRef]
- Napoleon, D.; Pavalakodi, S. A New Method for Dimensionality Reduction Using KMeans Clustering Algorithm for High Dimensional Data Set. Int. J. Comput. Appl. 2011, 13, 41–46. [Google Scholar] [CrossRef]
- Dhillon, I.S.; Guan, Y.; Kulis, B. Kernel K-Means: Spectral Clustering and Normalized Cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–28 July 2010; KDD ’04. Association for Computing Machinery: New York, NY, USA, 2004; pp. 551–556. [Google Scholar] [CrossRef]
- Zhang, R.; Rudnicky, A.I. A large scale clustering scheme for kernel K-Means. In Proceedings of the 2002 International Conference on Pattern Recognition, Quebec City, QC, Canada, 11–15 August 2002; Object Recognition Supported by User Interaction for Service Robots. Volume 4, pp. 289–292. [Google Scholar]
- Chitta, R.; Jin, R.; Havens, T.C.; Jain, A.K. Approximate Kernel K-Means: Solution to Large Scale Kernel Clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; KDD ’11. Association for Computing Machinery: New York, NY, USA, 2011; pp. 895–903. [Google Scholar] [CrossRef]
- Chen, L.; Zhou, S.; Ma, J. Fast Kernel k-means Clustering Using Incomplete Cholesky Factorization. arXiv 2020, arXiv:2002.02846. [Google Scholar]
- Aradnia, A.; Haeri, M.A.; Ebadzadeh, M.M. Adaptive Explicit Kernel Minkowski Weighted K-means. arXiv 2020, arXiv:2012.02805. [Google Scholar]
- Liu, J.; Cao, F.; Gao, X.; Yu, L.; Liang, J. A Cluster-Weighted Kernel K-Means Method for Multi-View Clustering. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; pp. 4860–4867. [Google Scholar]
- Jaiswal, R.; Garg, N. Analysis of k-Means++ for Separable Data. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. In Proceedings of the 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Cambridge, MA, USA, 15–17 August 2012; pp. 591–602. [Google Scholar] [CrossRef]
- Brunsch, T.; Röglin, H. A Bad Instance for K-means++. Theor. Comput. Sci. 2013, 505, 19–26. [Google Scholar] [CrossRef]
- Bhattacharya, A.; Jaiswal, R.; Ailon, N. A Tight Lower Bound Instance for k-means++ in Constant Dimension. In Theory and Applications of Models of Computation, Proceedings of the 11th Annual Conference, TAMC 2014, Chennai, India, 11–13 April 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 7–22. [Google Scholar] [CrossRef] [Green Version]
- Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable K-means++. Proc. VLDB Endow. 2012, 5, 622–633. [Google Scholar] [CrossRef]
- Bingham, E.; Mannila, H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; KDD ’01. pp. 245–250. [Google Scholar] [CrossRef]
- Papp, D.; Szűcs, G. MMKK++ algorithm for clustering heterogeneous images into an unknown number of clusters. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2018, 16, 30. [Google Scholar] [CrossRef] [Green Version]
- Oglic, D.; Gärtner, T. Nyström Method with Kernel K-Means++ Samples as Landmarks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2652–2660. [Google Scholar]
- Johnson, W.; Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 1984, 26, 189–206. [Google Scholar]
- LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database 2010. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 27 July 2021).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Spira, A.; Beane, J.E.; Shah, V.; Steiling, K.; Liu, G.; Schembri, F.; Gilman, S.; Dumas, Y.M.; Calner, P.; Sebastiani, P.; et al. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat. Med. 2007, 13, 361–366. [Google Scholar] [CrossRef] [PubMed]
- Freije, W.A.; Castro-Vargas, F.E.; Fang, Z.; Horvath, S.; Cloughesy, T.; Liau, L.M.; Mischel, P.S.; Nelson, S.F. Gene Expression Profiling of Gliomas Strongly Predicts Survival. Cancer Res. 2004, 64, 6503–6510. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- 20 Newsgroups Dataset. Available online: https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups (accessed on 27 July 2021).
- Reuters-21578 Dataset. Available online: https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection (accessed on 27 July 2021).
- Ingo, F.; Kurt, H.; David, M. Text Mining Infrastructure in R. J. Stat. Softw. 2008, 25. [Google Scholar] [CrossRef] [Green Version]
- Liberty, E.; Ailon, N.; Singer, A. Dense Fast Random Projections and Lean Walsh Transforms. In Proceedings of the Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, 11th International Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008, Boston, MA, USA, 25–27 August 2008; pp. 512–522. [Google Scholar] [CrossRef] [Green Version]
- Li, P.; Hastie, T.; Church, K. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; Volume 2006, pp. 287–296. [Google Scholar] [CrossRef] [Green Version]
- Ailon, N.; Chazelle, B. The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors. SIAM J. Comput. 2009, 39, 302–322. [Google Scholar] [CrossRef] [Green Version]
- Kane, D.M.; Nelson, J. Sparser Johnson-Lindenstrauss Transforms. arXiv 2014, arXiv:1012.1577. [Google Scholar] [CrossRef] [Green Version]
Dataset | d | k = 8 | k = 10 | k = 20 | k = 50 |
---|---|---|---|---|---|
Reuters-21578 (D = 2003) | 200 | 2.08% | 1.81% | 0.60% | 0.42% |
300 | 1.85% | 0.86% | 1.37% | 0.47% | |
401 | 0.12% | 0.40% | 0.53% | 0.02% | |
GLI-85 (D = 22283) | 2228 | 0.5% | 0.73% | 1.06% | 0.19% |
3342 | 1.53% | 2.59% | 0.93% | 0.81% | |
4457 | 2.03% | 1.10% | 0.20% | 3.94% | |
SMK-187 (D = 19993) | 1999 | 1.73% | 0.40% | 0.16% | 1.40% |
2999 | 3.74% | 1.98% | 0.02% | 0.01% | |
3999 | 3.33% | 2.62% | 0.62% | 0.58% |
Dataset | d | k = 8 | k = 10 | k = 20 | k = 50 |
---|---|---|---|---|---|
Reuters-21578 (D = 2003) | baseline | 20.48 | 33.47 | 141.22 | 920.74 |
200 | 8.61 | 11.49 | 26.28 | 135.16 | |
300 | 10.65 | 14.39 | 32.59 | 132.22 | |
401 | 13.03 | 17.89 | 41.63 | 175.59 | |
GLI-85 (D = 22283) | baseline | 3.77 | 5.93 | 22.44 | 142.02 |
2228 | 4.77 | 5.33 | 8.68 | 33.63 | |
3342 | 6.7 | 7.24 | 11.21 | 37.07 | |
4457 | 8.85 | 9.30 | 13.34 | 40.89 | |
SMK-187 (D = 19993) | baseline | 5.96 | 8.88 | 32.27 | 194.97 |
1999 | 5.92 | 6.35 | 9.86 | 31.76 | |
2999 | 8.48 | 8.97 | 12.97 | 38.41 | |
3999 | 10.99 | 11.56 | 16.07 | 44.94 |
Dataset | D | d | Proposed1 | Baseline2 | Speed-up |
---|---|---|---|---|---|
MNist | 784 | 100 | 0.42 | 2.22 | 5.4× |
ImageNet-Cat | 1000 | 200 | 19.80 | 81.82 | 4.13× |
SMK-187 | 19,993 | 100 | 0.53 | 13.80 | 26.1× |
GLI-85 | 22,283 | 100 | 0.36 | 3.24 | 9.10× |
Reuters | 929 | 100 | 110.2 | 603.3 | 5.47× |
20Newsgroups | 947 | 200 | 27.8 | 65.9 | 2.37× |
Dataset | d1 | % Changes in WCSS 2 | Stdev / Mean of Kernel Kmeans++ w/RP 3 | Stdev / Mean of Vanilla Kernel Kmeans 4 |
---|---|---|---|---|
MNIST | 100 | −1.18% | 1.83% | |
200 | −1.47% | 1.81% | 1.6% | |
500 | −1.40% | 1.70% | ||
ImageNet | 100 | −2.14% | 4.78% | |
200 | −2.84% | 5.01% | 4.32% | |
500 | −2.59% | 3.99% | ||
20 Newsgroup | 100 | 0.41% | 0.53% | |
200 | 0.04% | 0.69% | 0.5% | |
500 | −0.21% | 0.75% | ||
Reuters | 200 | 0.16% | 0.40% | |
300 | 0.20% | 0.38% | 0.23% | |
400 | 0.27% | 0.81% | ||
GLI-85 | 50 | 0.60% | 1.55% | |
100 | 0.11% | 1.53% | 1.41% | |
200 | 1.12% | 1.48% | ||
SMK-187 | 50 | −0.02% | 2.13% | |
100 | 0.37% | 1.96% | 1.35% | |
200 | −0.67% | 2.16% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chan, J.Y.K.; Leung, A.P.; Xie, Y. Efficient High-Dimensional Kernel k-Means++ with Random Projection. Appl. Sci. 2021, 11, 6963. https://doi.org/10.3390/app11156963
Chan JYK, Leung AP, Xie Y. Efficient High-Dimensional Kernel k-Means++ with Random Projection. Applied Sciences. 2021; 11(15):6963. https://doi.org/10.3390/app11156963
Chicago/Turabian StyleChan, Jan Y. K., Alex Po Leung, and Yunbo Xie. 2021. "Efficient High-Dimensional Kernel k-Means++ with Random Projection" Applied Sciences 11, no. 15: 6963. https://doi.org/10.3390/app11156963
APA StyleChan, J. Y. K., Leung, A. P., & Xie, Y. (2021). Efficient High-Dimensional Kernel k-Means++ with Random Projection. Applied Sciences, 11(15), 6963. https://doi.org/10.3390/app11156963