Detecting Trivariate Associations in High-Dimensional Datasets
Abstract
:1. Introduction
2. Related Works
3. Proposed Method
3.1. Trivariate Mutual Information
3.2. Trivariate Characteristic Matrix
- Firstly, when there is no dependency between the variables, the quadratic optimization does not improve the adaptive division and the single optimization gives a correlation value close to 0.
- Secondly, when there is a strong correlation between variables without noise, the quadratic optimization gives very high correlation values for all relation types, and the correlation values given to different relations are close to 1, better than the single optimization.
- Thirdly, when there is a dependency relationship between variables but accompanied by noise, under the same noise level, the approximate of the correlation value given by the quadratic optimization for different relationships is higher than that of the single optimization. Therefore, the quadratic optimization further encapsulates the relationship on the basis of the single optimization, and improves the performance in terms of equitability.
3.3. Generating the Trivariate Characteristic Matrix
Algorithm 1: Generating trivariate characteristic matrix. |
Input:D, Parameter c controls the granularity of the partition Output: M(D) for x = 2 to do Getting equipartition R with size x for y = 2 to do Getting equipartition Q with size y z = [I(x, y, 2), I(x, y, 3)…, I(x, y, z)] = Dynamic optimizing(D, Q, R, z) for k = 2 to z do M(D)x, y, k= I(x, y, k)/log min{x, y, k} end for Pl = argmax{M(D)x,y,l|Pl,2 ≤ l ≤ z} IT′ (x, y, l) = QuadraticApproxMI(D,R,Pl,cl) M(D)x, y, l= max{I′(x, y, l), I(x, y, l)}/log min{x, y, l} end for end for |
3.4. Time Complexity
3.5. Mathematical Analysis
4. Comparisons with Other Methods
4.1. Performance on Noiseless Relationships
4.2. Performance on Noisy Relationships
5. Exploring GHO Dataset for Associations
5.1. Trivariate Associations with QOTIC
5.2. Bivariate Associations with MIC
5.3. Comparing QOTIC with MIC on GHO Dataset
- The first group of trivariate association is infant deaths, under-five deaths and life expectancy, and its correlation value is 0.942. In the three sets of bivariate associations, the correlation values of infant deaths, under-five deaths and life expectancy are 0.31, 0.342, and the correlation value of infant deaths and under-five deaths is 0.958.
- The second group of trivariate association is thinness 1–19 years, thinness five to nine years and life expectancy, and its correlation value is 0.857. In the three sets of bivariate associations, the correlation values of thinness 1–19 years, thinness five to nine years and life expectancy are 0.386 and 0.384, respectively, while the correlation values of thinness 1–19 years and thinness five to nine years are 0.912.
- The third group of trivariate association is polio, diphtheria and life expectancy, and its correlation value is 0.777. In the three sets of bivariate associations, the correlation values of polio, diphtheria and life expectancy are 0.298, 0.295, while the correlation value of polio and diphtheria is 0.801.
- The fourth group of trivariate association is the percentage of expenditure, GDP and life expectancy, and its correlation value is 0.685. In the three sets of bivariate associations, the correlation values of the percentage of expenditure, GDP and life expectancy are 0.31 and 0.377, respectively, while the correlation value of the percentage of expenditure and GDP is 0.714.
- The fifth group of trivariate association is income composition of resources, schooling and life expectancy, and its correlation value is 0.67. In the three sets of bivariate associations, the correlation values of income composition of resources, schooling and life expectancy are 0.614 and 0.497, respectively, while the correlation value of percentage expenditure and GDP is 0.72.
- The sixth group of trivariate associations is Adult Mortality, percentage expenditure and life expectancy, and its correlation value is 0.642. In the three sets of bivariate associations, the correlation values of adult mortality, the percentage of expenditure and life expectancy are 0.709 and 0.31, respectively, while the correlation values of adult mortality and the percentage of expenditure are 0.22.
- The seventh group of trivariate associations is adult mortality, GDP and life expectancy, and its correlation value is 0.64. In the three sets of bivariate associations, the correlation values of adult mortality, GDP and life expectancy are 0.709 and 0.377, respectively, while the correlation value of adult mortality and GDP is 0.284.
6. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting novel associations in large data sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, D.R.; Wang, S.L.; Yuan, H.N. Software and applications of spatial data mining. WIREs-Data Min. Knowl. Discov. 2016, 6, 84–114. [Google Scholar] [CrossRef]
- Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily activity feature selection in smart homes based on Pearson correlation coefficient. IEEE Access 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
- Delicado, P.; Smrekar, M. Measuring non-linear dependence for two random variables distributed along a curve. Stat. Comput. 2009, 19, 255–269. [Google Scholar] [CrossRef] [Green Version]
- Yu, Y.M. On the maximal correlation coefficient. Stat. Probab. Lett. 2008, 78, 1072–1075. [Google Scholar] [CrossRef]
- Reshef, D.N.; Reshef, Y.A.; Mitzenmacher, M.; Sabeti, P. Equitability Analysis of the Maximal Information Coefficient, with Comparisons. arXiv 2013, arXiv:1301.6314. [Google Scholar]
- Wang, S.; Zhao, Y.; Shu, Y.; Yuan, H.; Geng, J.; Wang, S. Fast search local extremum for maximal information coefficient (MIC). J. Comput. Appl. Math. 2018, 327, 372–387. [Google Scholar] [CrossRef]
- Reshef, Y.A.; Reshef, D.N.; Sabeti, P.C.; Mitzenmacher, M. Equitability, interval estimation, and statistical power. Comput. Sci. 2020, 35, 202–217. [Google Scholar] [CrossRef]
- Simon, N.; Tibshirani, R. Comment on detecting novel associations in large data sets by Reshef et al, Science Dec 16. arXiv 2011, arXiv:1401.7645. [Google Scholar]
- Kinney, J.B.; Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 2014, 111, 3354–3359. [Google Scholar] [CrossRef] [Green Version]
- Reshef, D.N.; Reshef, Y.A.; Sabeti, P.C.; Mitzenmacher, M. An empirical study of the maximal and total information coefficients and leading measures of dependence. Ann. Appl. Stat. 2018, 334, 1518–1524. [Google Scholar] [CrossRef] [Green Version]
- Albanese, D.; Riccadonna, S.; Donati, C.; Franceschi, P. A practical tool for maximal information coefficient analysis. GigaScience 2018, 7, giy032. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing dependence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
- Szkely, G.J.; Rizzo, M.L.; Bakirov, N.K. Brownian distance covariance. Ann. Stat. 2009, 3, 1236–1265. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.; Shen, Y.; Zhang, J.Q. A nonlinear correlation measure for multivariable data set. Phys. D Nonlinear Phenom. 2005, 200, 287–295. [Google Scholar] [CrossRef]
- Zhang, Y.H.; Li, Y.J.; Zhang, T. Detecting multivariable correlation with maximal information entropy. J. Electron. Inf. Technol. 2015, 37, 123–129. [Google Scholar]
- Liu, C.L.; Wang, S.L.; Yuan, H.N.; Jing, H. Detecting three-dimensional associations in large data set. Chin. J. Electron. 2021, 30, 1131–1140. [Google Scholar]
- Liu, C.L.; Wang, S.L.; Yuan, H.N.; Liu, X. Detecting Unbiased Associations in Large Dataset. Big Data 2021. ahead of print. [Google Scholar] [CrossRef]
- Mordant, G.; Segers, J. Measuring dependence between random vectors via optimal transport. J. Multivar. Anal. 2021, 189, 104912. [Google Scholar] [CrossRef]
- Liu, C.L.; Wang, S.L.; Yuan, H.N.; Geng, J. Discovering the Association of Algae with Physicochemical Variables in Erhai Lake. Chin. J. Electron. 2020, 29, 265–272. [Google Scholar] [CrossRef]
- Guo, Y.J.; Yuan, Z.; Liang, Z.; Wang, Y.; Wang, Y.; Xu, L. Maximal Information Coefficient-Based Testing to Identify Epistasis in Case-Control Association Studies. Comput. Math. Methods Med. 2022, 2022, 7843990. [Google Scholar] [CrossRef] [PubMed]
- Mielniczuk, J.; Teisseyre, P. Detection of Conditional Dependence between Multiple Variables Using Multiinformation; ICCS: Chengdu, China, 2021; pp. 677–690. [Google Scholar]
- Wen, C.L.; Zhou, F.N.; Wen, C.B.; Chen, Z.G. An extended multi-scale principal component analysis method and application in anomaly detection. Chin. J. Electron. 2012, 21, 471–476. [Google Scholar]
- Trendafilov, N.T.; Fontanella, S. Exploratory factor analysis of large data matrices. Stat. Anal. Data Min. 2019, 12, 5–11. [Google Scholar] [CrossRef]
- Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tong, C.K.; Wang, H.L.; Wang, Y.D. Relation of canonical correlation analysis and multivariate synchronization index in SSVEP detection. Biomed. Signal Processing Control. 2022, 73, 103345. [Google Scholar] [CrossRef]
- Qiu, P.; Niu, Z. TCIC_FS: Total correlation information coefficient-based feature selection method for high-dimensional data. Knowl.-Based Syst. 2021, 231, 107418. [Google Scholar] [CrossRef]
Functions | X | Y | Z | Legend |
---|---|---|---|---|
1 | Linear | Linear | Linear | |
2 | Linear | Linear × Cosine | Linear × Sine | |
3 | Linear | Polynomial | Sine + Linear | |
4 | Linear | Piecewise Linear | Linear | |
5 | Linear | Cosine | Parabola | |
6 | Linear | Exponential | Linear | |
7 | Linear | Sine | Logarithm | |
8 | Linear | Exponential + Parabola | Cosine + Linear | |
9 | Linear | Sine | Cosine | |
10 | Polynomial | Cosine | Sine + Linear | |
11 | Linear | Polynomial | Polynomial | |
12 | Linear | Power | Linear |
Method | Bias | Variance | ||||
---|---|---|---|---|---|---|
Min | Mean | Max | Min | Mean | Max | |
QOTIC | 0.011 | 0.096 | 0.143 | 0.0 | 0.001 | 0.003 |
MTDIC | 0.017 | 0.11 | 0.164 | 0.0 | 0.002 | 0.005 |
TEIC | 0.084 | 0.391 | 0.61 | 0.001 | 0.034 | 0.139 |
Functions | n = 500 | n = 1000 | |||||
---|---|---|---|---|---|---|---|
MTDIC | QOTIC | ESTIC | MTDIC | QOTIC | ESTIC | ||
1 | Linear | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | Exponential | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | Logarithmic | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
4 | Quadratic | 0.94 | 0.96 | 0.96 | 1.00 | 1.00 | 1.00 |
5 | Cubic | 0.96 | 0.97 | 0.97 | 0.97 | 0.98 | 0.98 |
6 | Sinusoidal low freq. | 0.87 | 0.92 | 0.92 | 0.94 | 0.95 | 0.95 |
7 | Sinusoidal high freq. | 0.58 | 0.59 | 0.59 | 0.74 | 0.75 | 0.75 |
8 | Circle | 0.49 | 0.50 | 0.50 | 0.55 | 0.56 | 0.56 |
9 | Step function | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
10 | Two lines | 0.95 | 0.95 | 0.95 | 0.96 | 0.96 | 0.96 |
11 | X line | 0.47 | 0.51 | 0.51 | 0.53 | 0.55 | 0.55 |
12 | X curve | 0.48 | 0.54 | 0.54 | 0.53 | 0.56 | 0.56 |
Functions | MTDIC | QOTIC | ESTIC | |
---|---|---|---|---|
1 | Linear | 0.06 | 5.13 | 7.13 |
2 | Exponential | 0.07 | 5.13 | 7.13 |
3 | Logarithmic | 0.09 | 5.13 | 7.13 |
4 | Quadratic | 0.36 | 5.13 | 7.13 |
5 | Cubic | 0.19 | 5.13 | 7.13 |
6 | Sinusoidal low freq. | 0.41 | 5.13 | 7.13 |
7 | Sinusoidal high freq. | 0.40 | 5.13 | 7.13 |
8 | Circle | 0.34 | 5.13 | 7.13 |
9 | Step function | 0.08 | 5.13 | 7.13 |
10 | Two lines | 0.10 | 5.13 | 7.13 |
11 | X line | 0.39 | 5.13 | 7.13 |
12 | X curve | 0.37 | 5.13 | 7.13 |
Group | Trivariate Associations | QOTIC | Bivariate Associations | MIC | |||
---|---|---|---|---|---|---|---|
1 | infant deaths | under-five deaths | life expectancy | 0.942 | infant deaths | life expectancy | 0.31 |
under-five deaths | life expectancy | 0.342 | |||||
infant deaths | under-five deaths | 0.958 | |||||
2 | thinness 1–19 years | thinness five to nine years | life expectancy | 0.857 | thinness 1–19 years | life expectancy | 0.386 |
thinness five to nine years | life expectancy | 0.384 | |||||
thinness 1–19 years | thinness five to nine years | 0.912 | |||||
3 | polio | diphtheria | life expectancy | 0.777 | polio | life expectancy | 0.298 |
diphtheria | life expectancy | 0.295 | |||||
polio | diphtheria | 0.801 | |||||
4 | percentage expenditure | GDP | life expectancy | 0.685 | percentage expenditure | life expectancy | 0.31 |
GDP | life expectancy | 0.377 | |||||
percentage expenditure | GDP | 0.714 | |||||
5 | income composition of resources | schooling | life expectancy | 0.67 | income composition of resources | life expectancy | 0.614 |
schooling | life expectancy | 0.497 | |||||
income composition of resources | schooling | 0.72 | |||||
6 | adult mortality | percentage expenditure | life expectancy | 0.642 | adult mortality | life expectancy | 0.709 |
percentage expenditure | life expectancy | 0.31 | |||||
adult mortality | percentage expenditure | 0.22 | |||||
7 | adult mortality | GDP | life expectancy | 0.64 | adult mortality | life expectancy | 0.709 |
GDP | life expectancy | 0.377 | |||||
adult mortality | GDP | 0.284 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, C.; Wang, S.; Yuan, H.; Dang, Y.; Liu, X. Detecting Trivariate Associations in High-Dimensional Datasets. Sensors 2022, 22, 2806. https://doi.org/10.3390/s22072806
Liu C, Wang S, Yuan H, Dang Y, Liu X. Detecting Trivariate Associations in High-Dimensional Datasets. Sensors. 2022; 22(7):2806. https://doi.org/10.3390/s22072806
Chicago/Turabian StyleLiu, Chuanlu, Shuliang Wang, Hanning Yuan, Yingxu Dang, and Xiaojia Liu. 2022. "Detecting Trivariate Associations in High-Dimensional Datasets" Sensors 22, no. 7: 2806. https://doi.org/10.3390/s22072806
APA StyleLiu, C., Wang, S., Yuan, H., Dang, Y., & Liu, X. (2022). Detecting Trivariate Associations in High-Dimensional Datasets. Sensors, 22(7), 2806. https://doi.org/10.3390/s22072806