A Secure and Efficient Dynamic Analysis Scheme for Genome Data within SGX-Assisted Servers
Abstract
:1. Introduction
- SEDASGX provides a multi-party genome data analysis architecture for edge computing scenarios based on Intel SGX. This architecture uses the AES-GCM algorithm and the attributes of SGX to ensure the confidentiality and integrity of the code, genetic data, and analysis results. In addition, even if the terminal device does not require hardware support, it still meets the needs of users for uploading and analysis, reducing the hardware requirements of end users.
- SEDASGX is used to construct an oblivious data storage structure based on Path ORAM to avoid the leakage of access patterns of genome data in the analysis and update process, and encrypts them with a tamper-proof encryption algorithm (ASE-GCM) to guarantee the confidentiality and integrity of gene data, as well as the correctness of the analysis results.
- SEDASGX utilizes various genome analysis methods and dynamic updates. Additionally, the identity encryption technology based on the SGX security analysis architecture ensures that during the analysis process users cannot obtain each other’s analysis results.
2. Related Work
- Homomorphic encryption-based schemes. Kim et al. [6] used homomorphic encryption technology to encrypt a DNA sequence and conduct a secure evaluation of distribution over the encrypted data. Sarkar et al. [7] proposed a privacy-preserving genotype imputation using machine learning and a Paillier homomorphic encryption. Wang et al. [8] designed a homomorphic exact logistic regression model algorithm aiming at reducing the computational and storage costs. Blatt et al. [9] presented a privacy-preserving framework based on several advances in homomorphic encryption and demonstrated that it can perform an accurate GWAS analysis for a real dataset of more than 25,000 individuals, keeping all individual’s data encrypted and requiring no user interactions.
- Secure multi-party computation (SMC)-based schemes. Kamm et al. [10] proposed secretly sharing the sensitive data among several parties and computing GWAS over the distributed data. Dong et al. [11] proposed a secure and efficient GWAS scheme. Zhu et al. [12] proposed a privacy-preserving framework for conducting genome-wide association studies over outsourced patient data.
- Hardware-based schemes. Chen et al. [13] presented one of the first implementations of a Software Guard Extension (SGX)-based securely outsourced genetic testing framework, which leveraged multiple cryptographic protocols and a minimal perfect hash scheme to enable efficient and secure data storage and computation outsourcing. Mandal et al. [14] built a memory oblivious structure to search genome variants using Intel SGX. Kockan et al. [15] proposed SkSES, which employs sketch algorithm, data compression, and population stratification reduction methods to reduce the memory consumption.
3. Preliminaries
3.1. Intel SGX
- Memory Isolation. When a program runs on the SGX-enabled platform, it is divided into two parts: an untrusted storage region and a trusted isolated region (enclave). An enclave is a separate block of physical RAM that cannot be accessed by other applications, privileged software, OS, hypervisor, or the firmware on the system in Figure 1(1). Meanwhile, the enclave is used to protect sensitive data and codes. When an SGX program is hung or closed, the untrusted storage region mainly is utilized to store the encrypted sensitive data separated from the enclave.
- Remote Attestation. SGX provides a cryptographic verification that an enclave is running securely on a remote server platform. When an enclave is created, an app enclave generates a set of claims (i.e., Key and REPORT), and an SGX component QE (quoting Enclave) generates an attestation signature of the report by using the EGETKEY instruction in Figure 1((2)-6). QE returns the signature to the App in Figure 1((2)-7). After that, the App sends the signature and key to the verifier in Figure 1((2)-8). Finally, the verifier checks the signature using the Intel attestation services (IASs). In particular, QE only accepts measurements from trusted hardware, and the hardware guarantees that only enclaves that have been correctly created can be measured. Furthermore, a secure channel between the enclave and the client can be established via ECDH [21] and ECDSA [22]. This trusted channel is used to share the secret between the enclave and the client.
3.2. Oblivious RAM
3.3. Genome Analysis Methods
- Chisquare statistics are mainly used to study the relation of gene mutations in genome data. Because human beings are diploid and have two copies of all (non-sex)chromosomes, each person will have either the genotypes aa, ab(ba), or bb for each locus. For example, we sample the genomes of N individuals for particular single nucleotide variants (SNV), of which some have a particular disease (cases), and the rest do not (controls). The genotype aa, ab(ba), and bb represent 0, 1, and 2, respectively. Thus, person genomes can be represented by a vector g∈. The i-th entry corresponds to the number of transcripts the person i has of a allele at locus j. Let y∈ be a vector that represents the gene mutation state of a person if the i-th person has the disease, and if he/she does not. More details are described in [24]. As we will see, these values in Table 1 are sufficient to compute the statistic using the following Equation (1).
Genotype a b Sum Cases ( = 1) Controls ( = 0) Sum 2N - Fisher’s exact test [25] is a statistical test used to determine whether there is a nonrandom association between two categorical variables. It is generally used to determine whether a gene locus is statistically associated with a factor, which is more accurate than the Chisquare test. As preparation for extending to R × C contingency tables, the cell counts in 2 × 2 tables are denoted by for , 1 and , 1. Given the above Table 1, a more general formula is as follows (2). In simple terms, , 1 and , 1 are extended to , 1,…, R and , 1,…, C, followed by the row margins {, ,…, } and the column margins {, ,…, }. Meanwhile, the formula is extended to (3).
- Logistic regression [26] methods are commonly used in statistical analysis. They are also applied to genetic association studies due to the detection demand of massive genetic marker predictor variables, e.g., case/control status. Given a dichotomous phenotype vector of m observations, and a matrix of single nucleotide polymorphism (SNP) genotypes , let p = P. The likelihood function is:
3.4. Identity-Based Encryption (IBE)
- Setup()→(). This algorithm takes as input a security parameter . It outputs the public parameter and a private master key .
- KeyGen()→. This algorithm takes as input the public key , the private master key , and an identity . It outputs private key of .
- Enc()→C. This algorithm takes as input a public key and a message m. It outputs the ciphertext C for an identity .
- Dec()→m. This algorithm takes as input a private key and the ciphertext C. It outputs the message m.
4. SEDASGX Scheme
4.1. System Model
- The generates encrypted analysis queries and sends them to the . Additionally, the decrypts the query results.
- The encrypts genome data and uploads them to the . Meanwhile, the sends the update queries to the .
- The is divided into two parts: enclave and storage region. The enclave preprocesses all genomic data within the jurisdiction. After that, the enclave encrypts the processed data and sends them to the . The storage region mainly stores source data.
- The is divided into two parts: enclave and storage region. The enclave performs initialization operation, genome analysis operation, and update operation. The storage region mainly stores all encrypted data and oblivious storage structure.
- The is primarily responsible for remotely verifying the trusted execution environment of all edge servers and cloud services. Furthermore, the is also responsible for generating keys for each entity within the system and distributing them through secure channels.
4.2. Notations and Definitions
4.3. Constructions
- Second, the enclave creates the position map tables and stashes according to and . Meanwhile, the enclave randomly generates dummy genome data blocks and encrypts them according to . Then, the enclave writes them to the ORAM tree according to .
- Finally, the enclave computes by the and utilizes them to decrypt to obtain . The enclave re-encrypts to in the and writes them to according to updated .
- To generate a tailored analysis query q = , the k-th first selects the data from various ( and , 1 ≤ ≤ N), a certain gene locus , and a certain analysis method , -, or based on their analysis requirements.
- Subsequently, the employs their query key to generate an encrypted analysis query using the Enc algorithm, and sends to the .
Algorithm 1: Init Algorithm |
Input: ORAM structure key , encrypted data blocks , data key of , node capacity Z. Output: The position map , stashes , and encrypted ORAM trees , . Enclave: 1 Decrypt to obtain , , , ; 2 For do 3 Compute , , , , ; 4 Initialize an = of size and = of size ; 5 For do 6 Generate a dummy block , and ; 7 Copy , and to , , , , and set = 0; 8 IF 9 Encrypt blocks in and write them to by and empty ; 10 For do 11 Set and generate dummy blocks , 12 For do 13 Copy j, , , to , and set = ; 14 , , ,; 15 IF 16 Encrypt blocks in the and write them to by , and empty ; 17 Return ORAM tree , position map , and stashes ; |
- First, the enclave decrypts the to obtain , , , and .
- Second, the enclave acquires its corresponding ORAM tree and based on the values of and , and subsequently read the data from the ORAM tree and to the stash and by utilizing the position map table and , respectively.
- Finally, the enclave extracts the relevant information required for data analysis from the data blocks that have been read, and calls the corresponding analysis algorithm (e.g., ) to analyze the genome data according to the query request and obtain the corresponding analysis results R. After that, the enclave encrypts the R by using the corresponding query key and sends the to the k-th .
Algorithm 2: Analysis Algorithm |
Input: Encrypted ORAM tree , structure key , query key , encrypted analysis query . Output: Encrypted analysis result . 1 Encrypt the analysis query q to with query key ; 2 Decrypt to with , and read and to enclave; 3 For do 4 For do 5 Find and corresponding to ; 6 Read all blocks on in and decrypt them to ; 7 For 1≤≤ do 8 IF 9 Get , update , and , and re-encrypt ; 10 Repeat steps 3–10 to obtain of ; 11 Compute analysis result R according to of and of via Fisher’s exact algorithm; 12 Encrypt analysis result R to with and send it to the k-th ; 13 Decrypt encrypted analysis result and verify correctness; |
- Case 1: modify a single data block. First, the enclave decrypts the to obtain updated data, e.g., , , and . Then, the enclave finds in the to obtain and . After that, the enclave loads all data blocks on the in the and decrypts them into the . Next, the enclave searches the on the and replaces with the updated content . Meanwhile, it updates and copies it to . Finally, the enclave re-encrypts all data blocks in the and re-writes them back to the according to the updated .
- Case 2: add small data blocks. Upon receipt of the updated data blocks , uploaded by the , the enclave uses the to decrypt block by block. Then, the enclave generates the updated ORAM tree .
- Case 3: add massive data blocks. This is performed after receiving the encrypted data blocks , uploaded by . Because the actual number of genome data blocks in the storage region has exceeded the storage limit of the original ORAM trees, the enclave needs to call the Algorithm 1 to regenerate the new ORAM tree .
Algorithm 3: Update Algorithm |
Input: Encrypted ORAM tree , ORAM tree key , data key , encrypted update data blocks , the position map , the stash . Output: Updated ORAM trees , updated position map tables , updated stashes . 1 Preprocess update data to , and upload them to the ; 2 Decrypt to , and with , and find the ; 3 For do 4 IF 5 For 6 Find and corresponding to ; 7 Read all blocks on in and decrypt them to ; 8 For 1≤w≤ do 9 IF 10 Update , and ; 11 ELSE IF 12 For do 13 Update = , = , = ; 14 For do 15 For do 16 Copy ,, to , , and set =1; 17 Encrypt and write them to ORAM by , and empty ; 18 ELSE 19 Regenerate new , , and with the and updated data blocks ; 20 Return , , and |
5. Security Analysis
5.1. Correctness
5.2. Query Unlinkability
5.3. Access Pattern
5.4. CCA Security
6. Experiment Analysis
6.1. Implementation
6.2. Function Analysis
6.3. Performance Analysis
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Vellela, S.S.; Reddy, B.V.; Chaitanya, K.K.; Rao, M.V. An Integrated Approach to Improve E-Healthcare System using Dynamic Cloud Computing Platform. In Proceedings of the 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 23–25 January 2023; IEEE: Pisctaway, NJ, USA, 2023; pp. 776–782. [Google Scholar]
- Garrison, E.; Kronenberg, Z.N.; Dawson, E.T.; Pedersen, B.S.; Prins, P. A spectrum of free software tools for processing the VCF variant call format: Vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput. Biol. 2022, 18, e1009123. [Google Scholar] [CrossRef]
- Xu, Y.; Ren, J.; Wang, G.; Zhang, C.; Yang, J.; Zhang, Y. A blockchain-based nonrepudiation network computing service scheme for industrial IoT. IEEE Trans. Ind. Inform. 2019, 15, 3632–3641. [Google Scholar] [CrossRef]
- Gürsoy, G.; Li, T.; Liu, S.; Ni, E.; Brannon, C.M.; Gerstein, M.B. Functional genomics data: Privacy risk assessment and technological mitigation. Nat. Rev. Genet. 2022, 23, 245–258. [Google Scholar] [CrossRef] [PubMed]
- Sero, D.; Zaidi, A.; Li, J.; White, J.D.; Zarzar, T.B.G.; Marazita, M.L.; Weinberg, S.M.; Suetens, P.; Vandermeulen, D.; Wagner, J.K.; et al. Facial recognition from DNA using face-to-DNA classifiers. Nat. Commun. 2019, 10, 2557. [Google Scholar] [CrossRef]
- Kim, M.; Lauter, K. Private genome analysis through homomorphic encryption. BMC medical informatics and decision making. BioMed Cent. 2015, 15, 1–12. [Google Scholar]
- Sarkar, E.; Chielle, E.; Gürsoy, G.; Mazonka, O.; Gerstein, M.; Maniatakos, M. Fast and scalable private genotype imputation using machine learning and partially homomorphic encryption. IEEE Access 2021, 9, 93097–93110. [Google Scholar] [CrossRef]
- Wang, S.; Zhang, Y.; Dai, W.; Lauter, K.; Kim, M.; Tang, Y.; Xiong, H.; Jiang, X. HEALER: Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics 2016, 32, 211–218. [Google Scholar] [CrossRef] [PubMed]
- Blatt, M.; Gusev, A.; Polyakov, Y.; Goldwasser, S. Secure large-scale genome-wide association studies using homomorphic encryption. Proc. Natl. Acad. Sci. USA 2020, 117, 11608–11613. [Google Scholar] [CrossRef] [PubMed]
- Kamm, L.; Bogdanov, D.; Laur, S.; Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 2013, 29, 886–893. [Google Scholar] [CrossRef]
- Dong, C.; Weng, J.; Liu, J.N.; Yang, A.; Liu, Z.; Yang, Y.; Ma, J. Maliciously secure and efficient large-scale genome-wide association study with multi-party computation. IEEE Trans. Dependable Secur. Comput. 2022, 20, 1243–1257. [Google Scholar] [CrossRef]
- Zhu, X.; Ayday, E.; Vitenberg, R. A privacy-preserving framework for conducting genome-wide association studies over outsourced patient data. IEEE Trans. Dependable Secur. Comput. 2022, 20, 2390–2405. [Google Scholar] [CrossRef]
- Chen, F.; Wang, C.; Dai, W.; Jiang, X.; Mohammed, N.; Al Aziz, M.M.; Sadat, M.N.; Sahinalp, C.; Lauter, K.; Wang, S. PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre guard extension. BMC Med. Genom. 2017, 10, 77–85. [Google Scholar] [CrossRef]
- Mandal, A.; Mitchell, J.C.; Montgomery, H.; Roy, A. Data oblivious genome variants search on Intel SGX. In Proceedings of the International Workshop on Data Privacy Management, Barcelona, Spain, 6–7 September 2018; Springer International Publishing: Cham, Swizterland, 2018; pp. 296–310. [Google Scholar]
- Kockan, C.; Zhu, K.; Dokmai, N.; Karpov, N.; Kulekci, M.O.; Woodruff, D.P.; Sahinalp, S.C. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat. Methods 2020, 17, 295–301. [Google Scholar] [CrossRef]
- Costan, V.; Devadas, S. Intel SGX explained. Cryptology ePrint Archive. Available online: https://eprint.iacr.org/2016/086 (accessed on 7 August 2022).
- Zheng, W.; Wu, Y.; Wu, X.; Feng, C.; Sui, Y.; Luo, X.; Zhou, Y. A survey of Intel SGX and its applications. Front. Comput. Sci. 2021, 15, 1–15. [Google Scholar] [CrossRef]
- Amjad, G.; Kamara, S.; Moataz, T. Forward and backward private searchable encryption with SGX. In Proceedings of the 12th European Workshop on Systems Security, Dresden, Germany, 25–28 March 2019; pp. 1–6. [Google Scholar]
- Jiang, Q.; Qi, Y.; Qi, S.; Zhao, W.; Lu, Y. Pbsx: A practical private boolean search using Intel SGX. Inf. Sci. 2020, 521, 174–194. [Google Scholar] [CrossRef]
- Will, N.C.; Maziero, C.A. Intel Software Guard Extensions Applications: A Survey. ACM Comput. Surv. 2023, 55, 322. [Google Scholar] [CrossRef]
- Djoko, J.B.; Lange, J.; Lee, A.J. Nexus: Practical and secure access control on untrusted storage platforms using client-side sgx. In Proceedings of the 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Portland, OR, USA, 24–27 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 401–413. [Google Scholar]
- Johnson, D.; Menezes, A.; Vanstone, S. The elliptic curve digital signature algorithm (ECDSA). Int. J. Inf. Secur. 2001, 1, 36–63. [Google Scholar] [CrossRef]
- Stefanov, E.; Dijk, M.V.; Shi, E.; Chan, T.H.H.; Fletcher, C.; Ren, L.; Yu, X.; Devadas, S. Path ORAM: An extremely simple oblivious RAM protocol. J. ACM (JACM) 2018, 65, 1–26. [Google Scholar] [CrossRef]
- LeMay, C. Privacy-Preserving Chi-Squared Tests Using Homomorphic Encryption. Available online: https://www.cs.utexas.edu/~dwu4/courses/sp22/static/projects/LeMay.pdf (accessed on 15 August 2023).
- Zhao, G.; Yang, H.; Yang, J.; Zhang, L.; Yang, X. A data-based adjustment for fisher exact test. Eur. J. Stat. 2021, 1, 74–107. [Google Scholar] [CrossRef]
- Ayers, K.L.; Cordell, H.J. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 2010, 34, 879–891. [Google Scholar] [CrossRef]
- Naccache, D. Secure and practical identity-based encryption. IET Inf. Secur. 2007, 1, 59–64. [Google Scholar] [CrossRef]
- McGrew, D.A.; Viega, J. The security and performance of the Galois/Counter Mode (GCM) of operation. In Proceedings of the International Conference on Cryptology in India, Chennai, India, 20–22 December 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 343–355. [Google Scholar]
- Bard, G.V. Modes of Encryption Secure against Blockwise-Adaptive Chosen-Plaintext Attack. Cryptology ePrint Archive. Available online: https://eprint.iacr.org/2006/271, (accessed on 15 August 2023).
- GeneData Set. Available online: http://hgdownload-euro.soe.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ (accessed on 7 August 2023).
Notations | Descriptions |
---|---|
, , g, H | Security parameter, group, the generator group , and hash function |
ine , , , | Public key, master key, ORAM tree structure key, the data key of i-th |
ine , , | The query key of k-th , the data key of , the j-th data under the i-th |
ine , | The j-th initial vector under the i-th , the j-th tag in the i-th |
ine , | The j-th additional information under the i-th , preset block size |
ine , | The j-th ciphertext in the i-th , the -th data block under the i-th |
ine Z, , n | The node capacity of ORAM tree, genome locus, the i-th nation (i-th ) |
ine N, M, , | The number of , the number of , the path of ORAM tree, block identifier |
ine , , | The i-th ORAM tree, the i-th position map table, the i-th stash |
ine , | The sum number of data blocks under i-th , the size of |
ine , , | The high of ORAM tree, the size of , the size of |
ine , | The maximum path of , the state of data block (True:1 and false:0) |
Edge Server | Japan | Gambia | Britain | American | China |
---|---|---|---|---|---|
Genome Blocks | 383 | 414 | 4486 | 4715 | 9680 |
ORAM Tree Size | 3066 | 3066 | 49,146 | 49,146 | 98,298 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, B.; Zhou, F.; Wang, Q.; Feng, D. A Secure and Efficient Dynamic Analysis Scheme for Genome Data within SGX-Assisted Servers. Electronics 2023, 12, 5004. https://doi.org/10.3390/electronics12245004
Li B, Zhou F, Wang Q, Feng D. A Secure and Efficient Dynamic Analysis Scheme for Genome Data within SGX-Assisted Servers. Electronics. 2023; 12(24):5004. https://doi.org/10.3390/electronics12245004
Chicago/Turabian StyleLi, Bao, Fucai Zhou, Qiang Wang, and Da Feng. 2023. "A Secure and Efficient Dynamic Analysis Scheme for Genome Data within SGX-Assisted Servers" Electronics 12, no. 24: 5004. https://doi.org/10.3390/electronics12245004
APA StyleLi, B., Zhou, F., Wang, Q., & Feng, D. (2023). A Secure and Efficient Dynamic Analysis Scheme for Genome Data within SGX-Assisted Servers. Electronics, 12(24), 5004. https://doi.org/10.3390/electronics12245004