Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification
Abstract
:1. Introduction
2. Materials and Methods
2.1. Genome-Wide Association Studies
2.2. Algorithms for GWAS That Account for Population Stratification
2.3. Privacy-Preserving Computation and GWAS
2.3.1. Deployment Models for Privacy-Preserving GWAS
2.3.2. Cryptographic Secure Multi-Party Computation
2.3.3. Trusted Execution Environments
2.3.4. Adapting Algorithms for Secure Computing
- Cryptographic security. During computation, no computing party learns any private input or intermediate value computed by the function unless the value is explicitly published. Private values are also not leaked through changes in the running time of the algorithm. We achieve this by designing algorithms that process all private values either by using MPC and TEEs or by making the algorithm running time as independent of the inputs as possible. This means designing the algorithms as straight-line programs that have no branching based on private values or where each branching decision is analysed with regard to what it could leak in the running time.
- Source privacy. An algorithm is source-private if all its outputs and all intermediate values do not depend on the order of inputs. The GWAS algorithms in this paper are designed to avoid leaking information based on the order of donors’ data in the inputs. The order of SNPs in the inputs will affect the SNPs in the output, but this is by design as the related metadata are not private data.
- Output privacy. An algorithm is output-private if the results it declassifies do not leak the private inputs. The GWAS algorithms in this paper can be used either as parts of larger processing workflows, and in this case they do not declassify outputs. However, the results of the algorithms are statistical significances per SNP and contain no personal data given a sufficient amount of inputs.
3. Results
3.1. Privacy-Preserving EIGENSTRAT and FastPCA
Algorithm 1: Privacy-preserving EIGENSTRAT |
Algorithm 2: Privacy-preserving FastPCA randomised algorithm for top k left singular vectors |
3.2. Privacy-Preserving EMMAX
Algorithm 3: Privacy-preserving EMMAX |
Input: Genotype matrix (private), mask matrix (private), phenotype vector (private), number of GS-PCA iterations J (public) |
Output: Vector of statistics for the t-test ; ; ; ; ; ; ; ; // is a matrix, each row is equal to ; // is a matrix, each row is equal to ; ; // is a matrix, each column is equal to ; ; ; // Adds a different positive constant to each maximum value in to make sure that there is a single largest value in the vector, prevents leakage when using to find ; ; //Newton–Rhapson algorithm; ; ; // is a square matrix with at the diagonal; ; ; ; ; ; ; ; ; ; ; ; // is an matrix, each row is equal to ; ; ; // is an matrix, each row is equal to ; ; ; ; ; ; |
3.3. Genomic Control
Algorithm 4: Privacy-preserving genomic control |
3.4. Performance Measurements
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hartl, D.L.; Clark, A.G. Principles of Population Genetics, 4th ed.; Sinauer Associates: Sunderland, MA, USA, 2006. [Google Scholar]
- Hellwege, J.N.; Keaton, J.M.; Giri, A.; Gao, X.; Velez Edwards, D.R.; Edwards, T.L. Population Stratification in Genetic Association Studies. Curr. Protoc. Hum. Genet. 2017, 95, 1.22.1–1.22.23. [Google Scholar] [CrossRef] [PubMed]
- Campbell, C.D.; Ogburn, E.L.; Lunetta, K.L.; Lyon, H.N.; Freedman, M.L.; Groop, L.C.; Altshuler, D.; Ardlie, K.G.; Hirschhorn, J.N. Demonstrating stratification in a European American population. Nat. Genet. 2005, 37, 868. [Google Scholar] [CrossRef] [PubMed]
- European Data Protection Board. Recommendations 01/2020 on Measures that Supplement Transfer Tools to Ensure Compliance with the EU Level of Protection of Personal Data. 2020. Available online: https://edpb.europa.eu/our-work-tools/public-consultations-art-704/2020/recommendations-012020-measures-supplement-transfer_en (accessed on 19 August 2021).
- European Data Protection Supervisor. Preliminary Opinion 8/2020 on the European Health Data Space. 2020. Available online: https://edps.europa.eu/data-protection/our-work/publications/opinions/preliminary-opinion-82020-european-health-data-space_en (accessed on 19 August 2021).
- Kamm, L.; Bogdanov, D.; Laur, S.; Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 2013, 29, 886–893. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Constable, S.D.; Tang, Y.; Wang, S.; Jiang, X.; Chapin, S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med. Inform. Decis. Mak. 2015, 15, S2. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bogdanov, D.; Kamm, L.; Laur, S.; Sokk, V. Implementation and Evaluation of an Algorithm for Cryptographically Private Principal Component Analysis on Genomic Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 15, 1427–1432. [Google Scholar] [CrossRef]
- Bonte, C.; Makri, E.; Ardeshirdavani, A.; Simm, J.; Moreau, Y.; Vercauteren, F. Towards practical privacy-preserving genome-wide association study. BMC Bioinform. 2018, 19, 537. [Google Scholar] [CrossRef] [Green Version]
- Cho, H.; Wu, D.J.; Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 2018, 36, 547–551. [Google Scholar] [CrossRef]
- Tkachenko, O.; Weinert, C.; Schneider, T.; Hamacher, K. Large-Scale Privacy-Preserving Statistical Computations for Distributed Genome-Wide Association Studies. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security (ASIACCS’18), Incheon, Korea, 4–8 June 2018; ACM: New York, NY, USA, 2018; pp. 221–235. [Google Scholar]
- Bellafqira, R.; Ludwig, T.E.; Niyitegeka, D.; Génin, E.; Coatrieux, G. Privacy-Preserving Genome-Wide Association Study for Rare Mutations—A Secure FrameWork for Externalized Statistical Analysis. IEEE Access 2020, 8, 112515–112529. [Google Scholar] [CrossRef]
- Poddar, R.; Kalra, S.; Yanai, A.; Deng, R.; Popa, R.A.; Hellerstein, J.M. Senate: A Maliciously-Secure MPC Platform for Collaborative Analytics. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; USENIX Association: Vancouver, BC, Canada, 2021. [Google Scholar]
- Zhang, Y.; Dai, W.; Jiang, X.; Xiong, H.; Wang, S. FORESEE: Fully Outsourced secuRe gEnome Study basEd on homomorphic Encryption. BMC Med. Inform. Decis. Mak. 2015, 15, S5. [Google Scholar] [CrossRef] [Green Version]
- Wang, S.; Zhang, Y.; Dai, W.; Lauter, K.E.; Kim, M.; Tang, Y.; Xiong, H.; Jiang, X. HEALER: Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics 2016, 32, 211–218. [Google Scholar] [CrossRef] [Green Version]
- Chen, F.; Yang, H.; Kim, J.; Ohno-Machado, L.; Ding, S.; Jiang, X.; Wang, S.; Fox, D.; Lauter, K.; Lu, Y.; et al. PRINCESS: Privacy-protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS. Bioinformatics 2016, 33, 871–878. [Google Scholar] [CrossRef] [Green Version]
- Asvadishirehjini, A.; Kantarcioglu, M.; Malin, B. A Framework for Privacy-Preserving Genomic Data Analysis Using Trusted Execution Environments. In Proceedings of the 2020 Second IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 28–31 October 2020; pp. 138–147. [Google Scholar]
- Kockan, C.; Zhu, K.; Dokmai, N.; Karpov, N.; Kulekci, M.O.; Woodruff, D.P.; Sahinalp, S.C. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat. Methods 2020, 17, 295–301. [Google Scholar] [CrossRef]
- Pascoal, T.; Decouchant, J.; Boutet, A.; Veríssimo, P. DyPS: Dynamic, Private and Secure GWAS. In Proceedings of the Privacy Enhancing Technologies (PoPETS), Online, 12–16 July 2021; Volume 2, pp. 214–234. [Google Scholar]
- Price, A.L.; Patterson, N.J.; Plenge, R.M.; Weinblatt, M.E.; Shadick, N.A.; Reich, D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nat. Genet. 2006, 38, 904–909. [Google Scholar] [CrossRef] [PubMed]
- Simmons, S.; Sahinalp, C.; Berger, B. Enabling Privacy-Preserving GWASs in Heterogeneous Human Populations. Cell Syst. 2016, 3, 54–61. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mittos, A.; Malin, B.; Cristofaro, E.D. Systematizing Genomic Privacy Research—A Critical Analysis. arXiv 2017, arXiv:abs/1712.02193. [Google Scholar]
- Galinsky, K.J.; Bhatia, G.; Loh, P.R.; Georgiev, S.; Mukherjee, S.; Patterson, N.J.; Price, A.L. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 2016, 98, 456–472. [Google Scholar] [CrossRef] [Green Version]
- Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.Y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010, 42, 348–354. [Google Scholar] [CrossRef] [Green Version]
- Devlin, B.; Roeder, K. Genomic Control for Association Studies. Biometrics 1999, 55, 997–1004. [Google Scholar] [CrossRef] [PubMed]
- Bogdanov, D. Sharemind: Programmable Secure Computations with Practical Applications. Ph.D. Thesis, University of Tartu, Tartu, Estonia, 2013. [Google Scholar]
- Cramer, R.; Damgård, I.; Nielsen, J. Secure Multiparty Computation and Secret Sharing; Cambridge University Press: New York, NY, USA, 2015. [Google Scholar]
- Archer, D.W.; Bogdanov, D.; Lindell, Y.; Kamm, L.; Nielsen, K.; Pagter, J.I.; Smart, N.P.; Wright, R.N. From Keys to Databases—Real-World Applications of Secure Multi-Party Computation. Comput. J. 2018, 61, 1749–1771. [Google Scholar] [CrossRef] [Green Version]
- Hastings, M.; Hemenway, B.; Noble, D.; Zdancewic, S. SoK: General Purpose Compilers for Secure Multi-Party Computation. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 1220–1237. [Google Scholar]
- Randmets, J. An Overview of Vulnerabilities and Mitigations of Intel SGX Applications; Technical Report D-2-116; Cybernetica AS: Tallinn, Estonia, 2021. [Google Scholar]
- Bogdanov, D.; Kamm, L.; Laur, S.; Sokk, V. Rmind: A tool for cryptographically secure statistical analysis. IEEE Trans. Depend. Secur. Comput. 2016, 15, 481–495. [Google Scholar] [CrossRef] [Green Version]
- Bogdanov, D.; Laur, S.; Talviste, R. A Practical Analysis of Oblivious Sorting Algorithms for Secure Multi-party Computation. In Proceedings of the 19th Nordic Conference on Secure IT Systems (NordSec 2014), Tromsø, Norway, 15–17 October 2014; LNCS; Springer: Cham, Switzerland, 2014; Volume 8788, pp. 59–74. [Google Scholar]
- Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; John Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
- Laud, P.; Randmets, J. A Domain-Specific Language for Low-Level Secure Multiparty Computation Protocols. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (ACM 2015), Denver, CO, USA, 12–16 October 2015; pp. 1492–1503. [Google Scholar]
- Randmets, J. Programming Languages for Secure Multi-Party Computation Application Development. Ph.D. Thesis, University of Tartu, Tartu, Estonia, 2017. [Google Scholar]
- Bogdanov, D.; Niitsoo, M.; Toft, T.; Willemson, J. High-performance secure multi-party computation for data mining applications. Int. J. Inf. Secur. 2012, 11, 403–418. [Google Scholar] [CrossRef]
Donor ID | … | Phenotype Vector y | |||
---|---|---|---|---|---|
0 | 1 | … | 0 | 0 | |
1 | 0 | … | 0 | 1 | |
… | … | … | … | … | … |
2 | 0 | … | 2 | 1 |
Group | Total | |||
---|---|---|---|---|
Case | R | |||
Control | S | |||
Total | N |
Subtask | 1500 SNPs | 2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 2000 SNPs |
---|---|---|---|---|---|---|
217 Donors | 217 Donors | 217 Donors | 217 Donors | 434 Donors | 868 Donors | |
Table preparation | 67.4 s | 85.4 s | 218.4 s | 865.1 s | 161.6 s | 347.0 s |
GS-PCA | 182.8 s | 187.9 s | 183.3 s | 704.8 s | 685.8 s | 2606.0 s |
Stratification control | 1365.9 s | 1849.5 s | 4520.2 s | 18,684.3 s | 6214.4 s | 22,112.0 s |
Test statistics | 136.2 s | 174.5 s | 436.6 s | 1681.7 s | 344.2 s | 675.4 s |
Total | 1752.4 s | 2297.2 s | 5358.4 s | 21,413.2 s | 7406.0 s | 25,740.4 s |
(29.2 min) | (38.3 min) | (89.3 min) | (5.95 h) | (2.06 h) | (7.15 h) |
Subtask | 1500 SNPs | 2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 2000 SNPs |
---|---|---|---|---|---|---|
217 Donors | 217 Donors | 217 Donors | 217 Donors | 434 Donors | 868 Donors | |
Table preparation | 75.7 s | 102.1 s | 247.9 s | 1000.8 s | 198.1 s | 393.9 s |
PCA using random | 2482.0 s | 3290.9 s | 7920.7 s | 31,355.6 s | 6194.8 s | 11,900.2 s |
matrix theory | ||||||
Stratification control | 216.3 s | 293.7 s | 707.3 s | 2800.8 s | 564.9 s | 1139.3 s |
Test statistics | 159.1 s | 204.9 s | 505.6 s | 1998.5 s | 407.9 s | 815.6 s |
Total | 2933.0 s | 3891.5 s | 9381.5 s | 37,155.7 s | 7365.7 s | 14,249.0 s |
(48.9 min) | (64.9 min) | (2.61 h) | (10.32 h) | (2.04 h) | (3.96 h) |
Subtask | 1000 SNPs | 5000 SNPs | 20,000 SNPs | 1000 SNPs | 1000 SNPs |
---|---|---|---|---|---|
217 Donors | 217 Donors | 217 Donors | 100 Donors | 434 Donors | |
Table preparation | 44.7 s | 239.3 s | 891.0 s | 24.7 s | 89.8 s |
Kinship matrix | 700.0 s | 3420.2 s | 13,988.5 s | 173.3 s | 2844.0 s |
GS-PCA | 14,241.6 s | 14,209.2 s | 14,239.3 s | 1912.8 s | 104,067.3 s |
Maximum likelihood | 4934.3 s | 21,381.8 s | 81,961.9 s | 1147.5 s | 23,185.7 s |
Test statistics | 0.4 s | 1.7 s | 7.6 s | 0.4 s | 0.4 s |
Total | 19,920.9 s | 39,252.5 s | 111,088.3 s | 3258.7 s | 130,187.1 s |
(5.53 h) | (10.90 h) | (30.86 h) | (54.3 min) | (36.2 h) |
Subtask | 5000 SNPs | 300,000 SNPs | 300,000 SNPs | 300,000 SNPs |
---|---|---|---|---|
217 Donors | 217 Donors | 434 Donors | 661 Donors | |
Table preparation | 33.8 s | 1912.3 s | 3619.5 s | 5451.1 s |
Cochran–Armitage test | 0.3 s | 24.1 s | 23.5 s | 23.8 s |
Stratification control | 0.9 s | 83.3 s | 83.4 s | 83.6 s |
Total | 34.9 s | 2019.7 s | 3726.4 s | 5558.5 s |
(0.6 min) | (33.7 min) | (62.1 min) | (92.6 min) |
Subtask | EIGENSTRAT | EMMAX | ||||
---|---|---|---|---|---|---|
2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 1000 SNPs | 5000 SNPs | |
217 Donors | 217 Donors | 217 Donors | 868 Donors | 217 Donors | 217 Donors | |
Table preparation | 101 ms | 253 ms | 1600 ms | 272 ms | 28 ms | 215 ms |
Kinship matrix | 219 ms | 606 ms | 2642 ms | 3552 ms | 111 ms | 609 ms |
GS-PCA | 46 ms | 53 ms | 69 ms | 735 ms | 4202 ms | 4220 ms |
Stratification control/ | 443 ms | 1337 ms | 15675 ms | 2782 ms | ||
Maximum likelihood | 26 ms | 25 ms | ||||
Test statistics | 74 ms | 175 ms | 2489 ms | 442 ms | 409 ms | 2076 ms |
Total | 883 ms | 2424 ms | 22475 ms | 7783 ms | 4776 ms | 7145 ms |
Experiment | Data Size | MPC | HI | If HI Was 100 Times Slower |
---|---|---|---|---|
EIGENSTRAT | 5000 SNPs | 5358.4 s | 2.42 s | 242 s |
200 donors | (89.3 min) | |||
EMMAX | 5000 SNPs | 87,400 s | 7.14 s | 714 s |
200 donors | (24.3 h) | |||
Genomic control | 300,000 SNPs | 2096.5 s | 15.11 s | ∼20 s |
200 donors | (35 min) | (reading input data) | (reading input data) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ostrak, A.; Randmets, J.; Sokk, V.; Laur, S.; Kamm, L. Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. Cryptography 2021, 5, 21. https://doi.org/10.3390/cryptography5030021
Ostrak A, Randmets J, Sokk V, Laur S, Kamm L. Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. Cryptography. 2021; 5(3):21. https://doi.org/10.3390/cryptography5030021
Chicago/Turabian StyleOstrak, Andre, Jaak Randmets, Ville Sokk, Sven Laur, and Liina Kamm. 2021. "Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification" Cryptography 5, no. 3: 21. https://doi.org/10.3390/cryptography5030021
APA StyleOstrak, A., Randmets, J., Sokk, V., Laur, S., & Kamm, L. (2021). Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. Cryptography, 5(3), 21. https://doi.org/10.3390/cryptography5030021