Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data
Abstract
:1. Introduction
- We propose three new data quality measures that can be compuated directly from a given dataset and can be compared across datasets with different numbers of classes, examples, and features. Although our approach takes ideas from LDA (linear discriminant analysis) and PCA (principal component analysis), the techniques by themselves do not produce single numbers that are comparable across different datasets.
- We provide efficient algorithms to approximate the suggested data quality measures, making them available for large-scale high-dimensional data.
- The proposed class separability measure is strongly correlated with the actual classification performance of linear and non-linear classifiers in our experiments.
- The proposed in-class variability measures and quantify the diversity of data within each class and can be used to analyze redundancy or outlier issues.
2. Related Work
2.1. Descriptor-Based Approaches
2.2. Graph-Based Approaches
2.3. Classifier-Based Approaches
3. Methods
3.1. Fisher’s LDA
3.2. Proposed Data Quality Measures
3.2.1. Class Separability
3.2.2. In-Class Variability
3.3. Methods for Efficient Computation
3.3.1. Random Projection
3.3.2. Bootstrapping
Algorithm 1: Algorithm of class separability and in-class variability. |
4. Experiment Results
4.1. Datasets
4.2. Representation Performance of the Class Separability Measure
4.2.1. Correlation with Classifier Accuracy
4.2.2. Correlation with Other Quality Measures
4.2.3. Computation Time
4.2.4. Comparison to Exact Computation
4.3. Class-Wise In-Class Variability Measure,
4.4. Quality Ranking Using and
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ho, T.K.; Basu, M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 289–300. [Google Scholar]
- Baumgartner, R.; Somorjai, R. Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognit. Lett. 2006, 27, 1383–1389. [Google Scholar] [CrossRef]
- Branchaud-Charron, F.; Achkar, A.; Jodoin, P.M. Spectral metric for dataset complexity assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
- Hancock, J.M.; Zvelebil, M.J.; Cristianini, N. Fisher Discriminant Analysis (Linear Discriminant Analysis). Dict. Bioinform. Comput. Biol. Available online: https://doi.org/10.1002/9780471650126.dob0238.pub2 (accessed on 25 October 2020).
- Bingham, E.; Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the ACM Special Interest Group on Knowledge Discovery in Data, San Francisco, CA, USA, 26–29 August 2001. [Google Scholar]
- Efron, B. Bootstrap Methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
- Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications; Springer: Berlin, Germany, 2008. [Google Scholar]
- Leyva, E.; González, A.; Pérez, R. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 2015, 27, 354–367. [Google Scholar] [CrossRef]
- Sotoca, J.M.; Mollineda, R.A.; Sánchez, J.S. A meta-learning framework for pattern classification by means of data complexity measures. Intel. Artif. 2006, 10, 31–38. [Google Scholar]
- Garcia, L.P.F.; Lorena, A.C.; de Souto, M.C.P.; Ho, T.K. Classifier recommendation using data complexity measures. In Proceedings of the International Conference on Pattern Recognition, Beijing, China, 20–24 August 2018. [Google Scholar]
- de Melo, V.V.; Lorena, A.C. Using complexity measures to evolve synthetic classification datasets. In Proceedings of the IEEE International Joint Conference on Neural Network, Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar]
- Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Proceedings of the International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011. [Google Scholar]
- van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Duin, R.P.W.; Pękalska, E. Object representation, sample size and dataset complexity. Data Complexity in Pattern Recognition; Springer: Berlin, Germany, 2006. [Google Scholar]
- Li, C.; Farkhoor, H.; Liu, R.; Yosinski, J. Measuring the intrinsic dimension of object landscapes. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Golub, G.H.; Loan, C.F.V. Matrix Computations. Baltimore; Johns Hopkins University Press: Baltimore, MD, USA, 1996; pp. 470–507. [Google Scholar]
- Dasgupta, S.; Gupta, A. An Elementary Proof of the Johnson-Lindenstrauss Lemma; Technical Report; UC Berkeley: Berkeley, CA, USA, 1999. [Google Scholar]
- Kaski, S. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proceedings of the IEEE International Joint Conference on Neural Networks, Anchorage, AK, USA, 4–9 May 1998. [Google Scholar]
- LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 25 October 2020).
- Bulatov, Y. Notmnist Dataset. Technical Report, Google (Books/OCR). 2011. Available online: http://yaroslavvb.blogspot.it/2011/09/notmnist-dataset.html (accessed on 25 October 2020).
- Chaladz, G.; Kalatozishvili, L. Linnaeus 5 Dataset for Machine Learning. Available online: http://chaladze.com/l5/ (accessed on 25 October 2020).
- Krizhevsky, A. Learning Multiple Layer of Features from Tiny Images; Technical Report; Computer Science Department, University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Coates, A.; Lee, H.; Ng, A.Y. An analysis of single layer networks in unsupervised feature learning. J. Mach. Learn. Res. 2011, 15, 215–223. [Google Scholar]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 16–17 December 2011. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Opitz, D.; Maclin, R. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
Datasets | Accuracy | M | No. Classes | Description |
---|---|---|---|---|
MNIST | 92.64% | 12.5 k | 10 | Hand written digit |
notMNIST | 89.24% | 12.5 k | 10 | Fonts and glyphs similar to MNIST |
Linnaeus | 45.50% | 4.8 k | 5 | Botany and animal class images |
CIFAR10 | 42.84% | 12.5 k | 10 | Object recognition images |
STL10 | 40.88% | 12.5 k | 10 | Object recognition images |
SVHN | 45.60% | 12.5 k | 10 | House number images |
ImageNet-1 | 37.40% | 5 k | 10 | Visual recognition images (Tiny ImageNet) |
ImageNet-2 | 40.60% | 5 k | 10 | Visual recognition images (Tiny ImageNet) |
ImageNet-3 | 34.20% | 5 k | 10 | Visual recognition images (Tiny ImageNet) |
ImageNet-4 | 36.20% | 5 k | 10 | Visual recognition images (Tiny ImageNet) |
Classifier | Quality Measure | Pearson Corr. | Spearman Corr. | Time (s) |
---|---|---|---|---|
0.9386 | 0.7697 | 1253 | ||
0.9889 | 0.7333 | 5104 | ||
Perceptron | 0.9858 | 0.7333 | 9858 | |
0.9452 | 0.8182 | 23,711 | ||
(ours) | 0.9693 | 0.8061 | 354 | |
0.9039 | 0.3455 | 1253 | ||
0.9959 | 0.9758 | 5104 | ||
MLP-1 | 0.9961 | 0.9030 | 9858 | |
0.9295 | 0.5879 | 23,711 | ||
(ours) | 0.9261 | 0.3818 | 354 | |
0.8855 | 0.3455 | 1253 | ||
0.9908 | 0.9273 | 5104 | ||
MLP-2 | 0.9912 | 0.8788 | 9858 | |
0.9127 | 0.5879 | 23,711 | ||
(ours) | 0.9117 | 0.4303 | 354 |
(Ours) | |||||
---|---|---|---|---|---|
(ours) | 1.0000 | 0.9673 | 0.9322 | 0.9245 | 0.9199 |
0.9673 | 1.000 | 0.8909 | 0.8879 | 0.8806 | |
0.9322 | 0.8909 | 1.0000 | 0.9988 | 0.9400 | |
0.9245 | 0.8879 | 0.9988 | 1.000 | 0.9417 | |
0.9199 | 0.8806 | 0.9400 | 0.9417 | 1.0000 |
Image Size | Dimension | (Ours) | Speedup | ||||
---|---|---|---|---|---|---|---|
768 | 49 | 246 | 445 | 1963 | 17 | ||
3072 | 203 | 947 | 1850 | 2513 | 57 | ||
6912 | 492 | 2144 | 4182 | 3132 | 122 | ||
12,288 | 1011 | 3810 | 7466 | 5427 | 303 | ||
19,200 | 1783 | 5973 | 11,674 | 6838 | 461 | ||
27,648 | 2850 | 8652 | 17,346 | 9550 | 700 |
Sample Size | (Ours) | Speedup | ||||
---|---|---|---|---|---|---|
10,000 | 205 | 939 | 1904 | 2645 | 56 | |
20,000 | 411 | 3719 | 7797 | 5674 | 136 | |
30,000 | 605 | 8572 | 17,177 | 9065 | 220 | |
40,000 | 825 | 14,896 | 29,813 | 11,843 | 293 | |
50,000 | 1006 | 23,235 | 46,541 | 16,208 | 387 |
MNIST | 0.1550 | 0.0535 |
CIFAR10 | 0.0331 | 0.0173 |
notMNIST | 0.2123 | 0.0625 |
Linnaeus | 0.0250 | 0.0118 |
STL10 | 0.0695 | 0.0207 |
SVHN | 0.0004 | 0.0010 |
ImageNet-1 | 0.0062 | 0.0131 |
ImageNet-2 | 0.0293 | 0.0145 |
ImageNet-3 | 0.0191 | 0.0125 |
ImageNet-4 | 0.0092 | 0.0076 |
Pearson | 0.9847 | |
Spearman | 0.9152 |
Measure | × 1000 | |
---|---|---|
Class | ||
Airplane | 0.1557 | |
Automobile | 0.2069 | |
Bird | 0.1394 | |
Cat | 0.1803 | |
Deer | 0.0123 | |
Dog | 0.1830 | |
Frog | 0.1344 | |
Horse | 0.1775 | |
Ship | 0.1472 | |
Truck | 0.1997 |
Quality Measure | ||||||
---|---|---|---|---|---|---|
Data | ||||||
Original CIFAR10 | 0.2213 | 0.7909 | 0.7065 | 0.7030 | 42.84 | |
Degraded-CIFAR10 | 0.2698 | 0.7035 | 0.6096 | 0.6049 | 41.28 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cho, H.; Lee, S. Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data. Appl. Sci. 2021, 11, 472. https://doi.org/10.3390/app11020472
Cho H, Lee S. Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data. Applied Sciences. 2021; 11(2):472. https://doi.org/10.3390/app11020472
Chicago/Turabian StyleCho, Hyeongmin, and Sangkyun Lee. 2021. "Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data" Applied Sciences 11, no. 2: 472. https://doi.org/10.3390/app11020472
APA StyleCho, H., & Lee, S. (2021). Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data. Applied Sciences, 11(2), 472. https://doi.org/10.3390/app11020472