Reducing Uncertainty and Increasing Confidence in Unsupervised Learning
Abstract
:1. Introduction
1.1. Description of Unsupervised Learning
1.2. Related Works in the Literature
1.3. Existing Challenges in the Literature
- Difficulty in evaluating performance: Unlike supervised learning algorithms, where the performance can be easily evaluated based on the accuracy of the predicted output, evaluating the performance of unsupervised learning algorithms can be challenging. There is no clear measure of success or failure in unsupervised learning, and it often requires human interpretation to determine the usefulness of the learned patterns [11].
- Limited applicability: UL algorithms are typically used for data exploration and pattern discovery, and their applicability is often limited to certain data types. For example, clustering algorithms such as K-means are suitable for data with clear groupings but may not be useful for more complex relationships [12].
- Lack of interpretability: UL algorithms often produce models difficult to interpret, making it challenging to understand the underlying patterns and relationships in the data. This lack of interpretability can limit the usefulness of UL algorithms in certain applications, such as in the medical field, where understanding the reasons behind the predictions is critical [13].
- Vulnerability to noise: UL algorithms can be sensitive to noise in the data, affecting the learned patterns’ accuracy. For example, in clustering algorithms, outliers can distort the boundaries between clusters and affect the accuracy of the clustering results [14].
- Sensitivity to the initial selection of centroids: K-means algorithms are known to be sensitive to this. The initial choice of centroids can significantly impact the final clustering results. Additionally, K-means assumes that the clusters in the data have isotropic shapes and sizes, which may not hold in many real-world scenarios. These issues may lead to suboptimal solutions, where the algorithm fails to capture the true underlying structure of the data [15].
- Scalability issues: While effective in capturing complex relationships among data points, hierarchical clustering techniques suffer from computational intensiveness. As the data set’s size increases, hierarchical clustering algorithms’ computational requirements become significant. This scalability issue hinders the application of hierarchical clustering to large-scale data sets [16].
- Sensitivity to data characteristics: Deep unsupervised techniques, which utilise deep learning architectures for unsupervised learning tasks, can be sensitive to data set characteristics. Factors such as class imbalance, label distribution variations, or data quality differences can influence deep unsupervised algorithms’ performance. These characteristics need to be carefully considered and addressed to ensure reliable results [17,18].
- Dependency on feature learning: For deep unsupervised techniques to provide reliable results, feature learning plays a crucial role. The algorithms must learn meaningful representations or features from the data to accurately capture the underlying structure. Without effective feature learning, the performance of these algorithms may be compromised, leading to less reliable clustering results [10].
1.4. Current Work and Contributions
2. Materials and Methods
- K-means++ is an improvement over the classic K-means algorithm, which addresses the issue of the initial centroid selection. K-means++ initializes the centroids by selecting the first centroid randomly and then selecting the remaining centroids with a probability proportional to the distance from the previously selected centroids [22].
- Mini-batch K-means is a faster K-means variant that uses random data set subsets to update the centroids at each iteration. The algorithm selects a random subset of the data, assigns the data points to their nearest centroid, and then updates the centroids based on the mean of the assigned data points. The process is repeated until convergence or a defined maximum number of iterations is reached [23].
- Online K-means is a variant of k-means that updates the centroids online, meaning that new data points can be added to the data set at any time. The algorithm starts by initializing the centroids and then updates them as new data points are added to the data set. Online K-means is useful when the data set is too large to fit into memory or when new data continuously arrives [24].
- Kernel K-means is a variant of K-means that uses a kernel function to map the data points into a high-dimensional space. The algorithm then performs K-means clustering in the high-dimensional space, and the resulting clusters are projected back into the original space. Kernel K-means can handle non-linearly separable data. Furthermore, it is useful in scenarios where the data cannot be separated by a linear boundary [25].
- Spectral clustering is a variant of K-means that uses the eigenvectors of the graph Laplacian matrix to partition the data set into K clusters. The algorithm constructs a graph from the data points, where the nodes represent the data points and the edges represent their similarity. Spectral clustering is useful in scenarios where the data points have a complex structure and cannot be separated by a linear boundary [26].
3. Proposed Method
- Run the K-means++ clustering algorithm 100 times, requesting a certain number of clusters.
- For each run, calculate the coordinates of the centres of each cluster.
- Compare the coordinates of the centres across all 100 runs to identify the ones that appear most frequently.
- Choose the centres that appear most frequently as the dominant centres for this number of clusters and calculate CDI.
- Repeat the above steps (steps 1–4) nine more times to obtain 10 sets of dominant centres.
- Average the number of appearances of the dominant centres and the respective CDIs, obtained from the ten repetitions (ensuring that they refer to the same clustering centres). Calculate their upper and lower bounds and the variance within these two bounds.
- Repeat steps 1–6 for different numbers of clusters. We start from 3 and run for up to 10 clusters.
- Identify the number of clusters with the highest average CDI for all clustering possibilities (i.e., splitting from 3 and up to 10 clusters) and locate the respective dominant cluster centres. Verify if the variance in bounds is low (i.e., less than 30%)
- Choose the number of clusters with the highest average CDI and low variance as the optimal number of clusters for the RUN-ICON algorithm.
4. Application of the Proposed Method
4.1. Benchmarks
- Data set (a) (“2d-10c” in [36]), which contains a significantly higher number of clusters than the average data set (9 clusters), was chosen to assess the algorithm’s scalability and ability to handle data sets with numerous clusters.
- Data set (e) (“zelnik2” in [36]) contains randomly generated background noise and was chosen to test the efficiency of the algorithm in situations where noise is present.
- Data set (f) (“square2” in [36]) contains shapes with fuzzy boundaries. This data set helps evaluate the algorithm’s efficiency and effectiveness in scenarios with complex and densely distributed clusters.
4.2. Two-Dimensional Data Sets
4.3. Multi-Modal Data Set
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. The Algorithm
Algorithm A1 RUN-ICON |
for each from 3 to 10 do for in range(10) do for in range(100) do initialize using K-means++ algorithm with calculate coordinates of centres for each cluster in if not in then else end if end for end for if then end if end for |
References
- Hinton, G.E.; Dayan, P.; Frey, B.J.; Neal, R.M. The “Wake-Sleep” Algorithm for Unsupervised Neural Networks. Science 1995, 268, 1158–1161. [Google Scholar] [CrossRef] [PubMed]
- Krotov, D.; Hopfield, J.J. Unsupervised learning by competing hidden units. Proc. Natl. Acad. Sci. USA 2019, 116, 7723–7731. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
- Alloghani, M.; Al-Jumeily Obe, D.; Mustafina, J.; Hussain, A.; Aljaaf, A. A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science. In Supervised and Unsupervised Learning for Data Science; Springer: Cham, Switzerland, 2020; pp. 3–21. [Google Scholar] [CrossRef]
- Na, S.; Xumin, L.; Yong, G. Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar] [CrossRef]
- Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rew. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
- Lee, J.; Lee, G. Feature Alignment by Uncertainty and Self-Training for Source-Free Unsupervised Domain Adaptation. Neural Netw. 2023, 161, 682–692. [Google Scholar] [CrossRef]
- Lee, J.; Lee, G. Unsupervised domain adaptation based on the predictive uncertainty of models. Neurocomputing 2023, 520, 183–193. [Google Scholar] [CrossRef]
- Mousavi, Z.; Yousefi Rezaii, T.; Sheykhivand, S.; Farzamnia, A.; Razavi, S. Deep convolutional neural network for classification of sleep stages from single-channel EEG signals. J. Neurosci. Methods 2019, 324, 108312. [Google Scholar] [CrossRef]
- Mousavi, Z.; Varahram, S.; Mohammad Ettefagh, M.; Sadeghi, M.H. Dictionary learning-based damage detection under varying environmental conditions using only vibration responses of numerical model and real intact State: Verification on an experimental offshore jacket model. Mech. Syst. Signal Process. 2023, 182, 109567. [Google Scholar] [CrossRef]
- Orosz, T.; Vagi, R.; Mark, C.; Nagy, D.; Vadasz, P.; Istvan, A.; Megyeri, A. Evaluating Human versus Machine Learning Performance in a LegalTech Problem. Appl. Sci. 2021, 12, 297. [Google Scholar] [CrossRef]
- Melnykov, V.; Michael, S. Clustering Large Datasets by Merging K-Means Solutions. J. Classif. 2019, 37, 97–123. [Google Scholar] [CrossRef]
- Vellido, A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Comput. Appl. 2020, 32, 18069–18083. [Google Scholar] [CrossRef] [Green Version]
- Pintelas, E.; Livieris, I.; Pintelas, P. A Convolutional Autoencoder Topology for Classification in High-Dimensional Noisy Image Datasets. Sensors 2021, 21, 7731. [Google Scholar] [CrossRef] [PubMed]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
- Zou, Z.; Hua, K.; Zhang, X. HGC: Fast hierarchical clustering for large-scale single-cell data. Bioinformatics 2021, 37, 3964–3965. [Google Scholar] [CrossRef] [PubMed]
- Mehra, A.; Kailkhura, B.; Chen, P.Y.; Hamm, J. Understanding the Limits of Unsupervised Domain Adaptation via Data Poisoning. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17347–17359. [Google Scholar]
- Frank, M.; Drikakis, D.; Charissis, V. Machine-Learning Methods for Computational Science and Engineering. Computation 2020, 8, 15. [Google Scholar] [CrossRef] [Green Version]
- MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA; 1967; pp. 281–297. [Google Scholar]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
- Har-Peled, S.; Sadri, B. How Fast Is the k-Means Method? Algorithmica 2005, 41, 185–202. [Google Scholar] [CrossRef] [Green Version]
- Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA, 7–9 January 2007; Volume 8, pp. 1027–1035. [Google Scholar] [CrossRef]
- Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 1177–1178. [Google Scholar] [CrossRef]
- Cohen-Addad, V.; Guedj, B.; Kanade, V.; Rom, G. Online k-means Clustering. In Proceedings of the 24th Internationa Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; Volume 130, pp. 1126–1134. [Google Scholar]
- Schölkopf, B.; Smola, A.; Müller, K.R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef] [Green Version]
- Ng, A.; Jordan, M.; Weiss, Y. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems; Dietterich, T., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2001; Volume 14. [Google Scholar]
- Shi, C.; Wei, B.; Wei, S.; Wang, W.; Liu, H.; Liu, J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 31. [Google Scholar] [CrossRef]
- Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
- Kim, E.Y.; Kim, S.Y.; Ashlock, D.; Nam, D. MULTI-K: Accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform. 2009, 10, 260. [Google Scholar] [CrossRef] [Green Version]
- Shutaywi, M.; Kachouie, N. Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering. Entropy 2021, 23, 759. [Google Scholar] [CrossRef]
- Besta, M.; Kanakagiri, R.; Mustafa, H.; Karasikov, M.; Ratsch, G.; Hoefler, T.; Solomonik, E. Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons. In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 18–22 May 2020; pp. 1122–1132. [Google Scholar] [CrossRef]
- Manning, C.D.; Raghavan, P.; Schuetze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar] [CrossRef]
- Hoeffding, W. Probability Inequalities for sums of Bounded Random Variables. In The Collected Works of Wassily Hoeffding; Fisher, N.I., Sen, P.K., Eds.; Springer: New York, NY, USA, 1994; pp. 409–426. [Google Scholar] [CrossRef] [Green Version]
- Burges, C.; Burges, C.J. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
- Ting, D. Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 20–23 August 2018; pp. 2319–2328. [Google Scholar] [CrossRef] [Green Version]
- Barton Tomas. Clustering Benhmarks. 2023. Available online: https://github.com/deric/clustering-benchmark (accessed on 4 January 2023).
- Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.; Lukasik, S. Seeds Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/seeds (accessed on 21 May 2023).
- Poulinakis, K.; Drikakis, D.; Kokkinakis, I.W.; Spottswood, S.M. Machine-Learning Methods on Noisy and Sparse Data. Mathematics 2023, 11, 236. [Google Scholar] [CrossRef]
Data Set | Predicted Clusters | Average Highest CDI | Uncertainty |
---|---|---|---|
(a) | 9 | 0.62 | 0.30 |
(b) | 3 | 0.83 | 0.10 |
(c) | 5 | 0.70 | 0.23 |
(d) | 4 | 0.85 | 0.11 |
(e) | 3 | 0.95 | 0.07 |
(f) | 4 | 0.81 | 0.08 |
Data Set | PONC | RUN-ICON | K-m++ | Ward | Bayesian K-m | Repeat K-m |
---|---|---|---|---|---|---|
(a) | 9 | 9 | 9 | 9 | 9 | 9 |
(b) | 3 | 3 | 3 | 3 | 3 | 3 |
(c) | 4 | 5 | 4 | 4 | 5 | 4 |
(d) | 4 | 4 | 3 | 3 | 4 | 3 |
(e) | 3 | 3 | 6 or 10 | 10 | 6 | 10 |
(f) | 4 | 4 | 4 | 4 | 4 | 4 |
Error (%) | 3 | 21 or 43 | 43 | 21 | 43 |
Data Set | Data Points | (s) | (s) | (s) |
---|---|---|---|---|
(a) | 2997 | 18 | 20 | 28 |
(b) | 725 | 18 | 24 | 27 |
(c) | 873 | 18 | 21 | 31 |
(d) | 2014 | 26 | 33 | 42 |
(e) | 314 | 15 | 19 | 21 |
(f) | 1007 | 19 | 25 | 29 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Christakis, N.; Drikakis, D. Reducing Uncertainty and Increasing Confidence in Unsupervised Learning. Mathematics 2023, 11, 3063. https://doi.org/10.3390/math11143063
Christakis N, Drikakis D. Reducing Uncertainty and Increasing Confidence in Unsupervised Learning. Mathematics. 2023; 11(14):3063. https://doi.org/10.3390/math11143063
Chicago/Turabian StyleChristakis, Nicholas, and Dimitris Drikakis. 2023. "Reducing Uncertainty and Increasing Confidence in Unsupervised Learning" Mathematics 11, no. 14: 3063. https://doi.org/10.3390/math11143063
APA StyleChristakis, N., & Drikakis, D. (2023). Reducing Uncertainty and Increasing Confidence in Unsupervised Learning. Mathematics, 11(14), 3063. https://doi.org/10.3390/math11143063