Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata
Abstract
:1. Introduction
1.1. Motivation
1.2. Contributions
- We develop a privacy-preserving framework to collect medical microdata from each user while ensuring privacy. Specifically, we adapted Geo-I, which was originally designed for protecting location information privacy to collect medical microdata, which are a form of text data. In particular, this is the first attempt to utilize Geo-I for the privacy-preserving collection of medical microdata.
- To address the reduced data utility of collected datasets caused by Geo-I’s perturbation mechanism, we introduce a new data perturbation method for Geo-I that utilizes prior distribution information of the data to be collected. The primary advantage of leveraging a prior distribution of the collected data during the process of perturbing the original microdata is that it enables the collection of perturbed microdata of which distribution is similar with that of the original dataset.
- Furthermore, we evaluate the performance of our proposed algorithms using real-world data. The results demonstrate that our approach significantly outperforms existing methods. Especially, the experiment results confirm that our proposed method can maintain the data utility of the collected datasets, even in scenarios demanding high levels of privacy protection, which typically necessitate considerable perturbations to the original data.
2. Related Work
2.1. Privacy-Preserving Text Analysis and Collection
2.2. Geo-Indistinguishability and Its Applications
3. Background
3.1. Differential Privacy
3.2. Geo-Indistinguishability
4. Privacy-Preserving Framework for Collecting Medical Microdata
- Data-collection server: The data-collection server generates an obfuscation matrix designed for perturbing actual medical microdata under the -Geo-I and distributes them to each user. During the generation of this obfuscation matrix, the server utilizes prior distribution information derived from the available historical data.
- Individual user: When users receive the obfuscation matrix, they first perturb their own medical microdata based on the probabilities embedded in the obfuscation matrix and then send the perturbed data to the data-collection server.
4.1. Preliminary
4.2. Data-Collection Server
4.2.1. Vector Space Representation of Medical Microdata
4.2.2. Computation of Obfuscation Matrix
4.2.3. Estimation of Prior Distribution
4.3. User-Side Processing
4.4. Privacy Analysis
4.5. Limitations
5. Experiment
5.1. Experimental Setup
- The Laplace mechanism-based approach, which corresponds to the method proposed in [25] adapted to our problem (LM);
- The approach discussed in Section 4.2.2 that utilizes dimensionality reduction techniques to convert the vector generated with word-embedding techniques into a two-dimensional representation, followed by the application of the optimization mechanism (OM);
- The approach that utilizes the perturbation mechanism proposed in [29], which does not use a prior distribution (NP);
- The proposed method that leverages prior distribution information of the data being collected (PM).
- MIMIC-III: The first dataset is the MIMIC-III database [46]. This open-source database contains anonymized health data from more than 46,000 patients who were admitted to intensive care units (ICUs) in the United States from 2001 to 2012. We specifically use the admission data from this database, which consist of 58,976 records.
- Wikipedia Disease: For the second dataset, we first collect disease data from the “Lists of diseases” page on Wikipedia. These data are arranged in a tree structure with a maximum depth of 4. The names of the diseases utilized in the experiments are located at the leaf nodes of this tree, which altogether account for 61 nodes. Then, we randomly generate datum for 61,000 patients, each associated with one disease from the list of 61 diseases.
5.2. Results and Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ericsson Mobility Report. Available online: https://www.ericsson.com/en/reports-and-papers/mobility-report (accessed on 4 May 2023).
- Narayanan, A.; Shmatikov, V. How to break anonymity of the Netflix prize dataset. arXiv 2007, arXiv:0610105. [Google Scholar]
- Dash, S.; Shakyawar, S.K.; Sharma, M.; Kaushik, S. Big data in healthcare: Management, analysis and future prospects. J. Big Data 2019, 6, 54. [Google Scholar] [CrossRef] [Green Version]
- Chen, M.; Hao, Y.; Hwang, K.; Wang, L.; Wang, L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access 2017, 5, 8869–8879. [Google Scholar] [CrossRef]
- Schneble, C.O.; Elger, B.S.; Shaw, D.M. Google’s project Nightingale highlights the necessity of data science ethics review. EMBO Mol. Med. 2020, 12, e12053. [Google Scholar] [CrossRef]
- General Data Protection Regulation. Available online: https://gdpr-info.eu/ (accessed on 18 April 2023).
- Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef] [Green Version]
- LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Incognito: Efficient full domain k-anonymity. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14–16 June 2005. [Google Scholar]
- Mascetti, S.; Freni, D.; Bettini, C.; Wang, X.; Jajodia, S. Privacy in geo-social networks: Proximity notification with untrusted service providers and curious buddies. Int. J. Very Large Data Bases 2011, 20, 541–566. [Google Scholar] [CrossRef] [Green Version]
- Popa, R.A.; Blumberg, A.J.; Balakrishnan, H.; Li, F.H. Privacy and accountability for location-based aggregate statistics. In Proceedings of the ACM conference on Computer and communications security, Chicago, IL, USA, 17–21 October 2011. [Google Scholar]
- Dwork, C. Differential privacy. In Proceedings of the International Conference on Automata, Languages and Programming, Venice, Italy, 10–14 July 2006. [Google Scholar]
- Andres, M.E.; Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 4–8 November 2013; pp. 901–914. [Google Scholar]
- Bordenabe, N.E.; Chatzikokolakis, K.; Palamidess, C. Optimal geo-indistinguishable mechanisms for location privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 3–7 November 2014; pp. 251–262. [Google Scholar]
- Ahuja, R.; Ghinita, G.; Shahabi, C. A utility-preserving and scalable technique for protecting location data with geo-indistinguishability. In Proceedings of the International Conference on Extending Database Technology, Lisbon, Portuga, 26–29 March 2019; pp. 210–231. [Google Scholar]
- Qiu, C.; Squicciarini, A.C. Location privacy protection in vehicle-based spatial crowdsourcing via geo-indistinguishability. In Proceedings of the IEEE International Conference on Distributed Computing Systems, Dallas, TX, USA, 7–10 July 2019; pp. 1061–1071. [Google Scholar]
- Yao, L.; Mao, C.; Luo, Y. Clinical text classification with rule-based features and knowledge-guided convolutional neural networks. BMC Med. Inform. Decis. Mak. 2019, 19, 31–39. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hill, S.; Zhou, Z.; Saul, L.; Shacham, H. On the (In) effectiveness of Mosaicing and Blurring as Tools for Document Redaction. Proc. Priv. Enhancing Technol. 2016, 2016, 403–417. [Google Scholar] [CrossRef] [Green Version]
- Cumby, C.; Ghani, R. A machine learning based system for semi-automatically redacting documents. Proc. AAAI Conf. Artif. Intell. 2011, 25, 1628–1635. [Google Scholar] [CrossRef]
- Anandan, B.; Clifton, C.; Jiang, W.; Murugesan, M.; Camacho, P.P.; Si, L. t-Plausibility: Generalizing words to desensitize text. Trans. Data Priv. 2012, 5, 505–534. [Google Scholar]
- Sanchez, D.; Batet, M. C-sanitized: A privacy model for document redaction and sanitization. J. Assoc. Inf. Sci. Technol. 2016, 67, 148–163. [Google Scholar] [CrossRef] [Green Version]
- Yue, X.; Du, M.; Wang, T.; Li, Y.; Sun, H.; Chow, S.M. Differential privacy for text analytics via natural text sanitization. arXiv 2021, arXiv:2106.01221. [Google Scholar]
- Chen, H.; Mo, F.; Chen, C.; Cui, J.; Nie, J.Y. A customised text privatisation mechanism with differential privacy. arXiv 2022, arXiv:2207.01193. [Google Scholar]
- Carvalho, R.S.; Vasiloudis, T.; Feyisetan, O. BRR: Preserving privacy of text data efficiently on device. arXiv 2021, arXiv:2107.07923. [Google Scholar]
- Du, M.; Yue, X.; Chow, S.M.; Sun, H. Sanitizing sentence embeddings (and labels) for local differential privacy. In Proceedings of the ACM Web Conference, Austin, TX, USA, 30 April–4 May 2023; pp. 2349–2359. [Google Scholar]
- Feyisetan, O.; Balle, B.; Drake, T.; Diethe, T. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 178–186. [Google Scholar]
- Wang, Z.; Hu, J.; Lv, R.; Wei, J.; Wang, Q. Personalized privacy-preserving task allocation for mobile crowdsensing. IEEE Trans. Mob. Comput. 2018, 18, 1330–1341. [Google Scholar] [CrossRef]
- Yan, K.; Luo, G.; Zheng, X.; Tian, L.; Sai, A.M.V.V. A comprehensive location-privacy-awareness task selection mechanism in mobile crowd-wensing. IEEE Access 2019, 7, 77541–77554. [Google Scholar] [CrossRef]
- Kim, J.W.; Edemacu, K.; Jang, B. Privacy-preserving mechanisms for location privacy in mobile crowdsensing: A survey. J. Netw. Comput. Appl. 2022, 200, 103315. [Google Scholar] [CrossRef]
- Zhang, P.; Cheng, X.; Su, S.; Wang, N. Area coverage-based worker recruitment under geo-indistinguishability. Comput. Netw. 2022, 217, 109340. [Google Scholar] [CrossRef]
- Ma, C.; Chen, C.W. Nearby friend discovery with geo-indistinguishability to stalkers. Procedia Comput. Sci. 2014, 34, 352–359. [Google Scholar] [CrossRef] [Green Version]
- Huang, C.; Lu, R.; Zhu, H.; Shao, J.; Alamer, A.; Lin, X. EPPD: Efficient and privacy-preserving proximity testing with differential privacy techniques. In Proceedings of the IEEE International Conference on Communications, Kuala Lumpur, Malaysia, 22–27 May 2016; pp. 1–6. [Google Scholar]
- Tong, W.; Hua, J.; Zhong, S. A jointly differentially private scheduling protocol for ridesharing services. IEEE Trans. Inf. Forensics Secur. 2017, 12, 2444–2456. [Google Scholar] [CrossRef]
- Shi, D.; Ding, J.; Errapotu, S.M.; Yue, H.; Xu, W.; Zhou, X.; Pan, M. Deep Q-network-based route scheduling for TNC vehicles with passengers’ location differential privacy. IEEE Internet Things J. 2019, 6, 5. [Google Scholar] [CrossRef]
- Ren, W.; Tang, S. EGeoIndis: An effective and efficient location privacy protection framework in traffic density detection. Veh. Commun. 2020, 21, 100187. [Google Scholar] [CrossRef]
- Chen, R.; Li, L.; Chen, J.J.; Hou, R.; Gong, Y.; Guo, Y.; Pan, M. COVID-19 vulnerability map construction via location privacy preserving mobile crowdsourcing. In Proceedings of the GLOBECOM 2020-2020 IEEE Global Communications Conference, Taipei, Taiwan, 7–11 December 2020. [Google Scholar]
- Machanavajjhala, A.; Kifer, D.; Abowd, J.; Gehrke, J.; Vilhuber, L. Privacy: Theory meets practice on the map. In Proceedings of the IEEE International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008. [Google Scholar]
- Xiao, X.; Wang, G.; Gehrke, J. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 2011, 23, 1200–1214. [Google Scholar] [CrossRef]
- McSherry, F.D. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. Commun. ACM 2010, 53, 19–30. [Google Scholar] [CrossRef]
- Xiao, X.; Bender, G.; Hay, M.; Gehrke, J. iReduct: Differential privacy with reduced relative errors. In Proceedings of the ACM SIGMOD International Conference on Management of data, Athens, Greece, 12–16 June 2011. [Google Scholar]
- Jang, B.; Kim, I.; Kim, J.W. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 2019, 14, e0220976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Mackiewicz, A.; Ratajczak, W. Principal components analysis (PCA). Comput. Geosci. 1993, 19, 303–342. [Google Scholar] [CrossRef]
- Chatzikokolakis, K.; Elsalamouny, E.; Palamidessi, C. Efficient utility improvement for location privacy. In Proceedings of the Privacy Enhancing Technologies, Minneapolis, MN, USA, 18–21 July 2017; pp. 210–231. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- MIMIC-III Clinical Database. Available online: https://physionet.org/content/mimiciii/1.4/ (accessed on 18 April 2023).
Diagnosis | Method | Month | Average | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |||
newborn | 517 | 427 | 499 | 489 | 520 | 530 | 490 | 550 | 512 | 483 | 545 | 537 | 508.25 | |
289 | 230 | 230 | 263 | 276 | 264 | 274 | 293 | 272 | 250 | 264 | 269 | 264.50 | ||
49 | 51 | 50 | 50 | 62 | 35 | 43 | 41 | 53 | 54 | 46 | 43 | 48.08 | ||
bleed | 184 | 187 | 175 | 190 | 188 | 148 | 206 | 194 | 198 | 205 | 179 | 175 | 185.75 | |
116 | 126 | 113 | 128 | 115 | 107 | 133 | 144 | 135 | 152 | 132 | 119 | 126.67 | ||
61 | 65 | 81 | 37 | 57 | 81 | 71 | 64 | 72 | 66 | 59 | 58 | 64.33 | ||
coronary | 268 | 236 | 291 | 256 | 290 | 272 | 250 | 277 | 261 | 299 | 263 | 269 | 269.33 | |
198 | 167 | 209 | 196 | 215 | 204 | 189 | 212 | 199 | 246 | 207 | 186 | 202.33 | ||
109 | 110 | 100 | 125 | 108 | 117 | 152 | 156 | 120 | 109 | 105 | 130 | 120.08 | ||
pneumonia | 206 | 208 | 216 | 181 | 177 | 168 | 156 | 149 | 150 | 155 | 177 | 214 | 179.75 | |
162 | 158 | 151 | 133 | 125 | 111 | 98 | 127 | 104 | 117 | 136 | 158 | 131.67 | ||
50 | 45 | 19 | 58 | 42 | 56 | 93 | 53 | 62 | 78 | 33 | 15 | 50.33 | ||
sepsis | 149 | 115 | 163 | 144 | 143 | 163 | 153 | 159 | 130 | 141 | 147 | 160 | 147.25 | |
104 | 70 | 114 | 89 | 102 | 105 | 93 | 104 | 85 | 91 | 91 | 107 | 96.25 | ||
54 | 85 | 72 | 56 | 63 | 67 | 40 | 49 | 86 | 56 | 50 | 43 | 60.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, S.; Kim, J. Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata. Electronics 2023, 12, 2793. https://doi.org/10.3390/electronics12132793
Song S, Kim J. Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata. Electronics. 2023; 12(13):2793. https://doi.org/10.3390/electronics12132793
Chicago/Turabian StyleSong, Seungmin, and Jongwook Kim. 2023. "Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata" Electronics 12, no. 13: 2793. https://doi.org/10.3390/electronics12132793
APA StyleSong, S., & Kim, J. (2023). Adapting Geo-Indistinguishability for Privacy-Preserving Collection of Medical Microdata. Electronics, 12(13), 2793. https://doi.org/10.3390/electronics12132793