Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering
Abstract
:1. Introduction
2. Materials and Methods
2.1. Differential Privacy
2.1.1. Differential Privacy Definition
2.1.2. Differential Privacy Mechanisms
- Laplace mechanism
- 2.
- Exponential mechanism
2.1.3. Differential Privacy Properties
2.2. SU
2.3. Random Forest
3. Related Work
4. Algorithm
4.1. Algorithmic Framework
4.2. First Phase
4.3. Second Phase
4.4. Third Phase
5. Experiments
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the Science and Information Conference (SAI), London, UK, 27–29 August 2014; pp. 372–378. [Google Scholar]
- Hira, Z.M.; Gillies, D.F. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv. Bioinform. 2015, 2015, 1–13. [Google Scholar] [CrossRef]
- Corizzo, R.; Ceci, M.; Japkowicz, N. Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data. Big Data Res. 2019, 16, 18–35. [Google Scholar] [CrossRef]
- Corizzo, R.; Ceci, M.; Zdravevski, E.; Japkowicz, N. Scalable auto-encoders for gravitational waves detection from time series data. Expert. Syst. Appl. 2020, 151, 113378. [Google Scholar] [CrossRef]
- Zheng, K.; Li, T.; Zhang, B.; Zhang, Y.; Luo, J.; Zhou, X. Incipient Fault Feature Extraction of Rolling Bearings Using Autocorrelation Function Impulse Harmonic to Noise Ratio Index Based SVD and Teager Energy Operator. Appl. Sci. 2017, 7, 1117. [Google Scholar] [CrossRef]
- Gu, Y.; Yang, X.; Peng, M.; Lin, G. Robust weighted SVD-type latent factor models for rating prediction. Expert. Syst. Appl. 2020, 141, 112885. [Google Scholar] [CrossRef]
- Mistry, K.; Zhang, L.; Neoh, S.C.; Lim, C.P.; Fielding, B. A Micro-GA Embedded PSO Feature Selection Approach to Intelligent Facial Emotion Recognition. IEEE Trans. Cybern. 2016, 47, 1496–1509. [Google Scholar] [CrossRef]
- Xu, J.; Tang, B.; He, H.; Man, H. Semisupervised Feature Selection Based on Relevance and Redundancy Criteria. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 1974–1984. [Google Scholar] [CrossRef] [PubMed]
- Liu, H.; Motoda, H.; Yu, L. A selective sampling approach to active feature selection. Artif. Intell. 2004, 159, 49–74. [Google Scholar] [CrossRef]
- Kundu, P.P.; Mitra, S. Feature Selection Through Message Passing. IEEE Trans. Cybern. 2016, 47, 4356–4366. [Google Scholar] [CrossRef]
- Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A. A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef]
- Gómez-Ramírez, J.; Ávila-Villanueva, M.; Fernández-Blázquez, M.Á. Selecting the most important self-assessed features for predicting conversion to mild cognitive impairment with random forest and permutationbased methods. Sci. Rep. 2020, 10, 20630. [Google Scholar] [CrossRef]
- Christo, V.R.E.; Nehemiah, H.K.; Brighty, J.; Kannan, A. Feature Selection and Instance Selection from Clinical Datasets Using Co-operative Co-evolution and Classification Using Random Forest. IETE J. Res. 2020, 68, 2508–2521. [Google Scholar] [CrossRef]
- Paul, D.; Su, R.; Romain, M.; Sébastien, V.; Pierre, V.; Isabelle, G. Feature selection for outcome prediction in oesophageal cancer using genetic algorithm and random forest classifier. Comput. Med. Imaging Graph. 2017, 60, 42–49. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Wang, Y.; Wang, D.; Yin, Y.; Wang, Y.; Jin, Y. An improved random forest-based rule extraction method for breast cancer diagnosis. Appl. Soft Comput. 2019, 86, 105941. [Google Scholar] [CrossRef]
- Amaricai, A. Design Trade-offs in Configurable FPGA Architectures for K-Means Clustering. Stud. Inform. Control. 2017, 26, 43–48. [Google Scholar] [CrossRef]
- Xiangxiao, L.; Honglin, O.; Lijuan, X. Kernel-Distance-Based Intuitionistic Fuzzy c-Means Clustering Algorithm and Its Application. Pattern Recognit. Image Anal. 2019, 29, 592–597. [Google Scholar] [CrossRef]
- Mining, W.I.D. Data mining: Concepts and techniques. Morgan Kaufinann 2006, 10, 559–569. [Google Scholar] [CrossRef]
- Jasmine, M.; Kesavaraj, G. Implementation of K-means clustering algorithm in the crime data set. Program. Device Circuits Syst. 2020, 12, 13–18. [Google Scholar]
- Billard, L.; Kim, J. Hierarchical clustering for histogram data. Wiley Interdiscip. Rev. Comput. Stat. 2017, 9, e1405. [Google Scholar] [CrossRef]
- Lee, S.; Jung, J.; Park, I.; Park, K.; Kim, D.-S. A deep learning and similarity-based hierarchical clustering approach for pathological stage prediction of papillary renal cell carcinoma. Comput. Struct. Biotechnol. J. 2020, 18, 2639–2646. [Google Scholar] [CrossRef]
- Malzer, C.; Baum, M. A hybrid approach to hierarchical density-based cluster selection. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
- Thrun, M.C.; Ultsch, A. Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data. J. Classif. 2020, 38, 280–312. [Google Scholar] [CrossRef]
- Chiang, Y.-H.; Hsu, C.-M.; Tsai, A. Fast multi-resolution spatial clustering for 3D point cloud data. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Dwork, C. Differential privacy. In Automata, Languages and Programming, Proceedings of the 33rd International Colloquium, ICALP 2006, Part. II 33, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
- Dwork, C. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, Proceedings of the 5th International Conference, TAMC 2008, Proceedings 5, Xi’an, China, 25–29 April 2008; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar] [CrossRef]
- Dwork, C. The Differential Privacy Frontier (Extended Abstract). In Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2009; pp. 496–502. [Google Scholar] [CrossRef]
- Dwork, C. Differential privacy in new settings. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, 17–19 January 2010; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2010. [Google Scholar] [CrossRef]
- Dwork, C. A firm foundation for private data analysis. Commun. ACM 2011, 54, 86–95. [Google Scholar] [CrossRef]
- Dwork, C. The promise of differential privacy: A tutorial on algorithmic techniques. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, Palm Springs, CA, USA, 22–25 October 2011. [Google Scholar] [CrossRef]
- Dwork, C.; Jing, L. Differential privacy and robust statistics. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009. [Google Scholar] [CrossRef]
- Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
- McSherry, F.; Talwar, K. Mechanism Design via Differential Privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), Providence, RI, USA, 21–23 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 94–103. [Google Scholar]
- McSherry, F.D. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009. [Google Scholar] [CrossRef]
- Tran, B.; Xue, B.; Zhang, M. Variable-Length Particle Swarm Optimization for Feature Selection on High-Dimensional Classification. IEEE Trans. Evol. Comput. 2018, 23, 473–487. [Google Scholar] [CrossRef]
- Song, X.-F.; Zhang, Y.; Guo, Y.-N.; Sun, X.-Y.; Wang, Y.-L. Variable-Size Cooperative Coevolutionary Particle Swarm Optimization for Feature Selection on High-Dimensional Data. IEEE Trans. Evol. Comput. 2020, 24, 882–895. [Google Scholar] [CrossRef]
- Breiman, L. Random Forest. Mach. Learn. 2001, 45, 1. [Google Scholar] [CrossRef]
- Ansari, F.; Edla, D.R.; Dodia, S.; Kuppili, V. Brain-Computer Interface for wheelchair control operations: An approach based on Fast Fourier Transform and On-Line Sequential Extreme Learning Machine. Clin. Epidemiol. Glob. Heal. 2018, 7, 274–278. [Google Scholar] [CrossRef]
- Prasetiyowati, M.I.; Maulidevi, N.U.; Surendro, K. Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest. J. Big Data 2021, 8, 84. [Google Scholar] [CrossRef]
- Fu, X.; Feng, L.; Zhang, L. Data-driven estimation of TBM performance in soft soils using density-based spatial clustering and random forest. Appl. Soft Comput. 2022, 120, 108686. [Google Scholar] [CrossRef]
- Chavent, M.; Genuer, R.; Saracco, J. Combining clustering of variables and feature selection using random forests. Commun. Stat. Simul. Comput. 2019, 50, 426–445. [Google Scholar] [CrossRef]
- Li, X.; Luo, C.; Liu, P.; Wang, L.-E. Information entropy differential privacy: A differential privacy protection data method based on rough set theory. In Proceedings of the 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Fukuoka, Japan, 5–8 August 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Wu, N.B.; Peng, C.G.; Mou, Q.L. Information Entropy Metric Methods of Association Attributes for Differential Privacy. Acta Electonica Sin. 2019, 47, 2337. Available online: https://www.ejournal.org.cn/EN/Y2019/V47/I11/2337 (accessed on 1 January 2023).
- Peng, C.G.; Zhao, Y.Y.; Fan, M.-M. Principal Component Analysis Differential Privacy Data Publishing Algorithm Based on Maximum Information Coefficient. Netinfo Secur. 2020, 2, 37–48. [Google Scholar]
- Liu, Q.; Zhang, J.; Xiao, J.; Zhu, H.; Zhao, Q. A supervised feature selection algorithm through minimum spanning tree clustering. In Proceedings of the 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, Limassol, Cyprus, 10–12 November 2014; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar] [CrossRef]
- Cai, J.; Chao, S.; Yang, S.; Wang, S.; Luo, J. Feature selection based on density peak clustering using information distance measure. In Intelligent Computing Theories and Application, Proceedings of the 13th International Conference, ICIC 2017, Part. II 13, Liverpool, UK, 7–10 August 2017; Springer International Publishing: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
- Chatterjee, I.; Ghosh, M.; Singh, P.K.; Sarkar, R.; Nasipuri, M. A clustering-based feature selection framework for handwritten Indic script classification. Expert. Syst. 2019, 36, e12459. [Google Scholar] [CrossRef]
No | Features | No | Features |
---|---|---|---|
1 | Radius mean | 16 | Compactness severity |
2 | Texture mean | 17 | Concavity severity |
3 | Perimeter mean | 18 | Concave points severity |
4 | Area mean | 19 | Symmetry severity |
5 | Smoothness mean | 20 | Fractal dimension severity |
6 | Compactness mean | 21 | Radius worst |
7 | Concavity mean | 22 | Texture worst |
8 | Concave points mean | 23 | Perimeter worst |
9 | Symmetry mean | 24 | Area worst |
10 | Fractal dimension mean | 25 | Smoothness worst |
11 | Radius severity | 26 | Compactness worst |
12 | Texture severity | 27 | Concavity worst |
13 | Perimeter severity | 28 | Concave points worst |
14 | Area severity | 29 | Symmetry worst |
15 | Smoothness severity | 30 | Fractal dimension worst |
Actual Class | Predicted Class | |
---|---|---|
Positive | Negative | |
positive | TP | FN |
negative | FP | FN |
Conditions | Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|
RF_1 | B | 0.98 | 0.87 | 0.92 | 357 |
M | 0.82 | 0.97 | 0.89 | 212 | |
SVM_1 | B | 0.63 | 1.00 | 0.77 | 357 |
M | 0.00 | 0.00 | 0.00 | 212 | |
RF_5 | B | 0.99 | 0.88 | 0.93 | 357 |
M | 0.83 | 0.98 | 0.90 | 212 | |
SVM_5 | B | 0.64 | 1.00 | 0.78 | 357 |
M | 1.00 | 0.06 | 0.11 | 212 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chu, Z.; He, J.; Zhang, X.; Zhang, X.; Zhu, N. Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics 2023, 12, 1959. https://doi.org/10.3390/electronics12091959
Chu Z, He J, Zhang X, Zhang X, Zhu N. Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics. 2023; 12(9):1959. https://doi.org/10.3390/electronics12091959
Chicago/Turabian StyleChu, Zhiguang, Jingsha He, Xiaolei Zhang, Xing Zhang, and Nafei Zhu. 2023. "Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering" Electronics 12, no. 9: 1959. https://doi.org/10.3390/electronics12091959
APA StyleChu, Z., He, J., Zhang, X., Zhang, X., & Zhu, N. (2023). Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering. Electronics, 12(9), 1959. https://doi.org/10.3390/electronics12091959