1. Introduction
It is assumed that an attacker has background knowledge of name information about
n patients in a medical dataset with a certain disease. The attacker can statistically query the sum of disease status of
patients except the
i-th patient and the sum of disease status with all
n patients and then can infer whether the
i-th patient has a disease by comparing the two statistical query results. To mitigate the problem of individual privacy leakage caused by the above statistical inference attack, Dwork et al. [
1] proposed differential privacy (DP) to protect individual privacy independent of the presence or absence of any individual. Since DP requires that the data collector is trustworthy in a centralized setting, it is called centralized DP. Moreover, because DP considers global sensitivity of adjacent datasets, it is also known as global differential privacy (GDP). However, the data collector is untrusted in real-world applications. Therefore, Kasiviswanathan et al. [
2] proposed that local differential privacy (LDP) allows an untrusted third party to perform statistical analysis while achieving user’s privacy by random perturbation of local data. Both GDP and LDP have privacy-utility monotonicity and can achieve privacy-utility tradeoff [
3]. GDP and LDP have become popular methods of data privacy preserving of the centralized and local setting, respectively. However, GDP and LDP have different advantages and disadvantages. In
Table 1, we agree with Dobrota’s [
4] comparative analysis results of the advantages and disadvantages of GDP and LDP.
Because of the advantages of using GDP and LDP in the centralized and local setting, respectively, the data privacy community has widely studied GDP and LDP based on information theory. The current work focuses on GDP and LDP from the following aspects based on information theory, including privacy threat model, channel models and definitions of GDP and LDP, privacy-utility metrics of GDP and LDP, properties of GDP and LDP, and mechanisms satisfying GDP and LDP. Unless otherwise stated, the information-theoretic channel model refers to the discrete single symbol information-theoretic channel in this survey. However, there is no review work to systematically survey the above existing work on GDP and LDP from the perspective of information-theoretic channel.
Therefore, this paper systematically surveyed GDP and LDP under the information-theoretic channel model from the aspects of resisting privacy threat model, channel models, definitions, privacy-utility metrics, properties, and achieving mechanisms. Our main contributions are as follows.
(1) We summarized the privacy threat model under information-theoretic channel, and we provided a systematic survey on channel models, definitions, privacy-utility metrics, properties, and mechanisms of GDP and LDP from the perspective of information-theoretic channel.
(2) We presented a comparative analysis between GDP and LDP from the perspective of information-theoretic channel. Then, we concluded the common channel models, definitions, privacy-utility metrics, properties, and achieving mechanisms of GDP and LDP in the existing work.
(3) We surveyed applications of GDP and LDP in synthetic data generation. Specifically, we first presented the membership inference attack and model extraction attack against generative adversarial network (GAN). Then, we reviewed the differential privacy synthetic data generation with GAN and differential privacy synthetic data generation with federated learning, respectively.
(4) Through analyzing the advantages and disadvantages of the existing work for different application scenarios and data types, we also discussed the open problems of GDP and LDP based on different types of information-theoretic channel models in the future.
This paper is organized as follows.
Section 2 introduces the preliminaries.
Section 3 summarizes the privacy threat model of centralized and local data setting under information-theoretic channel.
Section 4 describes the channel models of GDP and LDP and uniformly states and analyzes the definitions of GDP and LDP under their channel models.
Section 5 summarizes and compares the information-theoretic privacy-utility metrics of GDP and LDP. In
Section 6, we present and analyze the properties of GDP and LDP from the perspective of information-theoretic channel.
Section 7 summarizes and analyzes the mechanisms of GDP and LDP from the perspective of information-theoretic channel.
Section 8 discusses the open problems of GDP and LDP from the perspective of different types of information-theoretic channel on different application scenarios and data types.
Section 9 concludes this paper.
2. Preliminaries
In this section, we introduce the preliminaries of GDP [
1], LDP [
5], and the information-theoretic channel model and metrics [
5,
6,
7,
8,
9,
10,
11]. The commonly used mathematical symbols are summarized in
Table 2.
2.1. GDP and LDP
A dataset x is collections of records coming from a universal set X, and each denotes the i-th item or a subset in the dataset x. When two datasets are different in only one item, the two datasets are adjacent datasets.
Definition 1 (GDP)
. A randomized mechanism with domain X is -DP if for all and for any two adjacent datasets , it holdswhere the probability space is over the coin flips of the mechanism . If , then is ε-DP. The coin flips of the mechanism mean that a DP mechanism inherently has only equally likely outcomes with regard to each record of each individual. The equally likely to occur means that the probability distribution of response to any query is the same independent of any individual opting presence or absence in the dataset. If is -DP, then is -DP with probability at least for all datasets x and when x and are adjacent datasets. For the definition of LDP, the coin flips of mechanism have the same meanings.
Definition 2 (LDP)
. A randomized mechanism satisfies ε-LDP if and only if for any pairs input values x and in the domain of X, and for any possible output , it holdswhere the probability space is over the coin flips of the mechanism . If , then is ε-LDP. 2.2. Information-Theoretic Channel and Metrics
The mathematical model of an information-theoretic channel can be denoted by , where
(1) X is an input random variable, and its value set is .
(2) Y is an output random variable, and its value set is .
(3) is the channel transition probability matrix, and the sum of the probabilities in each row satisfies .
In information-theoretic channel model, the Rényi divergence of a probability distribution on source X from another distribution is , where and . When is the uniform distribution with , the Rényi entropy is in terms of the Rényi divergence of . When , the Rényi entropy tends to the Shannon entropy of source X. When , the Rényi entropy tends to the min-entropy . The conditional Rényi entropy of X given Y is . When , the conditional Rényi entropy is conditional Shannon entropy . When , the conditional Rényi entropy is conditional min-entropy . The mutual information is the average information measure of X contained in random variable Y. Furthermore, the max-information is , and the -approximate max-information is .
Moreover, when , the Rényi divergence is Kullback–Leibler (KL) divergence . The KL-divergence is an instance of the family of f-divergence with non-negative convex functions . The total variation distance is also an instance of the family of f-divergence with , and the total variation distance between distributions and is . When , the Rényi divergence is is max-divergence , and the -approximate max-divergence is .
The expected distortion between input random variable
X and output random variable
Y is
where the distance measurement
is single symbol distortion. The average error probability is
Thus, the average error probability is expected Hamming distortion, when
is Hamming distortion in Equation (
3).
3. Privacy Threat Model on Information-Theoretic Channel
To mitigate the statistical inference attack, the GDP has a strong adversary assumption in which an adversary knows
dataset records and tries to identify the remaining one [
12,
13]. However, the adversary is usually computationally bounded. Thus, Mironov [
11] and Mir [
14] assumed that the adversary has prior knowledge over the set of possible input dataset
X. Furthermore, Smith [
15] proposed one-try attack, where an adversary is allowed to ask exactly one question about form, “is
?”. The Rényi min-entropy of
X denotes the probability of success for one-try attack with the best strategy, which chooses the
with maximum probability. The conditional Rényi min-entropy of
X given
Y captures the probability of guessing the value of
X in one single try when the output of
Y is known. Therefore, the privacy leakage of channel model is Rényi min-entropy leakage
under one-try attack [
7]. The Rényi min-entropy leakage is max-information, and it is the ratio of the probabilities of attack success with a priori probability and a posterior probability. Thus, a Rényi min-entropy leakage corresponds to the concept of Bayes risk, which can also be regarded as a measure of the effectiveness of the attack. The maximal leakage
is the maximal reduction in uncertainty about
X when
Y is observed [
16]. The maximal leakage is taken by maximizing over all input distributions.
When adversary possesses knowledge of a priori probability distribution of input, LDP can lead to the risk of privacy leakage [
2,
17,
18,
19,
20,
21,
22]. However, a better privacy-utility tradeoff can be achieved by incorporating the attacker’s knowledge into the LDP. Therefore, data utility can be improved by explicitly modeling the adversary’s prior knowledge of the LDP.
To sum up, the privacy threat of information-theoretic channel refers to the Bayes risk on input X, when attack known output Y. Thus, GDP and LDP can be used to mitigate the above privacy threat on information-theoretic channel for numerical data and categorical data, respectively.
5. Privacy-Utility Metrics of GDP and LDP under Information-Theoretic Channel Models
In
Table 7, we summarize and analyze the information-theoretic privacy metrics of GDP. When
, Rényi divergence is used as the privacy metric of GDP, which is a natural relaxation of GDP based on the Rényi divergence [
11]. Chaudhuri et al. [
25] used restricted divergences
as privacy metric. When
is Rényi divergence, the capacity bounded DP is a generalization of RDP. When
is
f-divergence, the capacity bounded DP is
-DP in [
8]. In [
14,
29], mutual information is used as the privacy metric of GDP, which is the amount of information leaked on
X after observing
Z. Cuff and Yu [
13] also used
-mutual-information as the privacy metric of GDP, which is the generalization of mutual information using Rényi divergence of order
. Alvim et al. [
7] used min-entropy leakage as the privacy metric of GDP, which is the ratio of the probabilities of right guessing a priori and a posterior. Furthermore, maximal leakage of channel
is used as the privacy metric of GDP, which is the maximal reduction in uncertainty of
X when
Z is given [
7,
16]. According to the graph symmetrization, Edwards et al. [
30] also regarded min-entropy leakage as an important measure of differential privacy loss of information channels under Blowfish privacy. Blowfish privacy is a generalization of global differential privacy. Rogers et al. [
31] defined the privacy metric of GDP using max-information and
-approximate max-information, which are a correlation measure allowing to bound the change in the conditional probability of an event relative to the prior probability. In [
32,
33], the privacy budget is directly used as privacy metric. Therefore, we can conclude that Rényi divergence is a more general privacy metric of GDP, since Rényi divergence is a generalization of restricted divergences and it can deduce
f-divergence, min-entropy leakage, maximal leakage, and max-information. Moreover, mutual information can also be used as a privacy metric of GDP.
We also summarize and analyze the information-theoretic utility metrics of GDP in
Table 8. In the information-theoretic channel model of GDP, expected distortion is mainly the utility measurement method, which shows how much information about the real answer can be obtained from the reported answer to average [
7,
33]. Padakandla et al. [
32] used fidelity as the utility metric, and the fidelity between transition probability distributions is measured by
-distortion metric. Mutual information is not only used as a privacy metric but also as a utility metric of GDP, which captures the amount of information shared by two variables [
33].
In
Table 9, we summarize and analyze existing work of information-theoretic privacy metrics of LDP. In the information-theoretic channel model of LDP, Duchi et al. [
17] defined the privacy metric of LDP using KL-divergence, which bounds the KL-divergence between distributions
and
by a quantity dependent on the privacy budget
and gives the upper bound of KL-divergence by combining with the total variation distance between
and
of the initial distributions of the
. Of course, mutual information can also be used as a privacy measure of LDP [
34,
35]. More generally, the existing work mainly uses the definition of the LDP as the privacy metric [
5,
36,
37,
38]. In [
39], Lopuhaä-Zwakenberg et al. gave an average privacy metric based on the ratio of conditional entropy of sensitive information
X.
Next, we summarize and analyze the information-theoretic utility metric of LDP in
Table 10. In the information-theoretic channel model of LDP,
f-divergence [
5] and mutual information [
5,
36,
38] can also be used as utility measures of LDP. In most cases, expected distortion is used as the utility measure of LDP [
20,
34,
35,
36,
37]. In [
39], Lopuhaä-Zwakenberg et al. presented distribution utility and tally utility metrics based on the ratio of relevant information.
6. Properties of GDP and LDP under Information-Theoretic Channel Models
In
Table 11, we present and analyze the properties of GDP based on the information-theoretic channel model. According to the Rényi divergence, Mironov [
11] demonstrated that the new definition shares many important properties with the standard definition of GDP, including post-processing, group privacy, and sequential composition. Considering
-restricted divergences including Rényi divergence, Chaudhuri et al. [
25] showed that capacity bounded DP has properties of post-processing, convexity, sequential composition, and parallel composition. Barthe and Köpf [
16] proved the sequential composition and parallel composition of GDP based on maximal leakage under the information-theoretic channel model. Barthe and Olmedo [
8] also proved the parallel composition of GDP using
f-divergence. We know that Rényi divergence can deduce maximal leakage and max-divergence.
f-divergence of Reference [
8] is actually max-divergence. Thus, we can conclude that, such as post-processing, convexity, group privacy, and sequential composition, and parallel composition, the properties of GDP can be proved by using Rényi divergence.
Similarly, GDP and LDP share the above properties under the information-theoretic channel model. Therefore, LDP also has the properties of post-processing, convexity, group privacy, and sequential composition, and parallel composition.
Moreover, we have showed that GDP and LDP have privacy-utility monotonicity [
3]. In GDP,
-DP shows
We can obtain
When
, we have
We can obtain
. We use mutual information as the utility metric. We can conclude that the mutual information of GDP decreases as the decreasing of the privacy budget, and vice versa. Privacy preserving is stronger and the utility is worse, and vice versa. Thus, GDP has privacy-utility monotonicity indicating the privacy-utility tradeoff. Similarly, we can observe that LDP also has privacy-utility monotonicity indicating the privacy-utility tradeoff.
7. GDP and LDP Mechanisms under Information-Theoretic Channel Models
In
Table 12, we summarize and compare the GDP mechanisms from the perspective of information-theoretic channel on uniform distribution of the source
X. Alvim et al. [
7] maximized expected distortion under min-entropy leakage constraint and obtained the optimal randomization mechanism using graph symmetry caused by the adjacent relationship between adjacent datasets. The optimal randomization mechanism can ensure better utility while achieving
-DP. According to the risk-distortion framework, Mir [
14] minimized mutual information when the constraint condition is expected distortion and obtain
-DP mechanism
by Lagrangian multipliers method, where
is a normalization function. GDP mechanism of [
14] is corresponding to the exponential mechanism [
40]. The conditional probability distribution
minimizes the privacy leakage risk given a distortion constraint. Ayed et al. [
33] maximized mutual information when constraint condition is DP and solved the constrained maximization program to obtain DP mapping under strongly symmetric channel.
In addition, Mironov [
11] analyzed the RDP of three basic mechanisms and their self-composition, including randomized response, Laplace mechanism, and Gaussian mechanism, and gave the parameters of RDP of these mechanisms. Considering a linear adversary and unrestricted adversary, Chaudhuri et al. [
25] also discussed the capacity bounded DP properties of Laplace mechanism, Gaussian mechanism, and matrix mechanism and presented the bound of privacy budget
of Laplace mechanism and Gaussian mechanism under KL-divergence and Rényi divergence, respectively.
In
Table 13, we summarize and compare the LDP mechanisms from the perspective of information-theoretic channel under uniform distribution of the source
X. According to the rate-distortion function, References [
34,
35,
37] maximized mutual information under expected Hamming distortion
D constraint and obtained privacy budget
for binary channel and privacy budget
for discrete alphabets. Kairouz et al. [
5] maximized KL-divergence and mutual information under LDP constraint and obtained binary randomized response mechanism, multivariate randomized response mechanism, and quaternary randomized response mechanism by solving the privacy-utility maximization problem, which is equivalent to solving the finite-dimensional linear program. Although Ayed et al. [
33] maximized mutual information about GDP constraint, they also obtained binary randomized response mechanism and multivariate randomized response mechanism under a strongly symmetric channel. Wang et al. [
38] maximized mutual information on LDP constraint and obtained the
k-subset mechanism with respect to the uniform distribution on the source
X. When
, the 1-subset mechanism is the multivariate randomized response mechanism. When
and
, the multivariate randomized response mechanism is the binary randomized response mechanism. Xiong et al. [
36] minimized privacy budget
under expected distortion constraint, which is equivalent to solving a standard generalized linear-fractional program via the bisection method. However, Xiong et al. [
36] did not give a specific expression of the optimal privacy channel
.
Furthermore, Duchi et al. [
41] showed that randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. Holohan et al. [
42] proposed following optimal mechanism of randomized response satisfying
-DP under uniform distribution of the source
X, which is
Erlingsson et al. [
43] proposed randomized aggregatable privacy-preserving ordinal response (RAPPOR) by applying randomized response in a novel manner. RAPPOR provides privacy guarantee using permanent randomized response and instantaneous randomized response and ensures high-utility analysis of the collected data. RAPPOR encodes each value
into a length-
k binary bit vector
B. For permanent randomized response, RAPPOR generates
with the probability
where
. With respect to instantaneous randomized response, RAPPOR perturbs
with the probability
8. Differential Privacy Synthetic Data Generation
Data sharing facilitates training better models, decision making, and the reproducibility of scientific research. However, if the data are shared directly, it will face the risk of privacy leakage and the problem of small training sample size. Thus, synthetic data are often used to replace the sharing of real data. At present, one of the main methods for synthetic data generation is generative adversarial network [
44]. GAN consists of two neural networks: one is a generator, and the other is a discriminator. The generator generates a realistic sample by inputting a noise obeying multivariable Gaussian distribution or uniform distribution. The discriminator is a binary classifier (such as 0–1 classifier) to judge whether the input sample is real or fake. In other words, the discriminator can distinguish whether each input sample comes from the real sample set or the fake sample set. However, the generator makes the ability of making samples as strong as possible so that the discriminator cannot judge whether the input sample is a real sample or a fake sample. According to this process, GAN can generate synthetic data to approximate the real data. Because the synthetic data accurately reflect the distribution of training data, it can avert privacy leakage by replacing real data sharing, augment small-scale training data, and be generated as desired. Thus, GAN can generate synthetic data for time series, continuous, and discrete data [
45].
However, because the discriminator easily memorizes the training data, it brings the risk of privacy leakage [
46]. Therefore, GAN mainly faces the privacy threat of membership inference attack and model extraction attack in
Table 14. Hayes et al. [
47] proposed a membership inference attack against the generative models, which means that the attacker can determine whether it is used to train the model given a data point. Liu et al. [
48] proposed a new membership inference attack, co-membership inference attack, which checks whether the given
n instances are in the training data, where the prior knowledge is completely used or not at all in the training. Hilprecht et al. [
49] proposed a Monte Carlo attack on the membership inference against generative models, which yields high membership inference accuracy. Chen et al. [
50] systematically analyzed the potential risk of privacy leakage caused by the generative models and proposed the classification of membership inference attacks, including not only the existing attacks but also the proposed generic attack model based on reconstruction. Hu and Pang [
51] studied the model extraction attack against GAN by stealing the machine learning model whose purpose is to copy the machine learning model through query access to the target model. In order to mitigate the model extraction attack, Hu and Pang designed defenses based on input and output perturbation by perturbing latent code and generating samples, respectively.
However, the existing work mainly achieves the model protection of neural network based on differential privacy. By using the
norm of the gradient and the clipping threshold to clip the gradient, and using the Gaussian mechanism to randomly perturb the clipping gradient, Abadi et al. [
52] proposed differential privacy stochastic gradient decent (DP-SGD) to protect the privacy of training data during the training process and demonstrated the moment accountant of the privacy loss that provides a tighter bound on the privacy loss compared to the generic strong composition theorem of differential privacy [
9].
Next, in
Table 15 and
Table 16, we mainly review the work of synthetic data generation based on differential privacy GAN and differential privacy GAN with federated learning from the following aspects: gradient perturbation, weight perturbation, data perturbation, label perturbation, and objective function perturbation. Thus, our work is different from the existing surveys [
53,
54].
8.1. Differential Privacy Synthetic Data Generation with Generative Adversarial Network
Because the discriminator of GAN can easily remember the training samples, training GAN with sensitive or private data samples breaches the privacy of the training data. Thus, using gradient perturbation can protect the privacy of the sensitive training data by training GAN models with differential privacy based on DP-SGD. Existing work protects the privacy of the training dataset by adding carefully designed noise to clipping gradients during the learning procedure of discriminator and uses moment accountant or RDP accountant to better keep track of the privacy cost for improving the quality of synthetic data. RDP accountant [
11] provides a tighter bound for privacy loss in comparison with the moment accountant. In gradient perturbation, clipping strategy and perturbation strategy improve the performance of the model while preserving privacy of the training dataset.
Using gradient perturbation, Lu and Yu [
55] proposed a unified framework for publishing differential privacy data based on GAN, such as tabular data and graphs, and synthetic data with acceptable utility in differential privacy manner. Xie et al. [
56] proposed a differential privacy Wasserstein GAN (WGAN) [
57] model, which adds carefully designed noise to the clipping gradient in the learning process, generates high-quality data points at a reasonable privacy level, and uses moment accountant to ensure the privacy in the iterative gradient descent process. Frigerio et al. [
45] developed a differential privacy framework for privacy protection data publishing using GAN, which can easily adapt to the generation of continuous, time series, and discrete data and maintain the original distribution of features and the correlation between them at a good level of privacy. Torkzadehmahani et al. [
58] introduced a differential privacy condition GAN (CGAN) [
59] training framework based on clipping and perturbation strategy, which generates synthetic data and corresponding labels while preserving the privacy of training datasets and uses RDP accountant to track the privacy budget of expenses. Liu et al. [
60] proposed a GAN model for privacy protection, which achieves differential privacy by adding carefully designed noise to the clipping gradient in the process of model learning, uses the moment accountant strategy to improve the stability and compatibility of the model by controlling the loss of privacy, and generates high-quality synthetic data while retaining the required available data under a reasonable privacy budget. Ha and Dang [
61] proposed a local differential privacy GAN model for noise data generation, which establishes a generative model by clipping the gradient in the model and adding Gaussian noise to the gradient to ensure the differential privacy. Chen et al. [
62] proposed gradient-sanitized WGAN, which allows the publication of sanitized sensitive data under strict privacy guarantee and can more accurately distort gradient information so as to train deeper models and generate more information samples. Yang et al. [
63] proposed a differential privacy gradient penalty WGAN (WGAN-GP) [
64] to train a generative model with privacy protection function, which can provide strong privacy protection for sensitive data and generate high-quality synthetic data. Beaulieu-Jones et al. [
65] used the auxiliary classifier GAN (AC-GAN) [
66] with different privacy to generate simulated synthetic participants very similar to Systolic Blood Pressure Trial participants, which can generate synthetic participants and promote secondary analysis and repeatability investigation of clinical datasets by strengthening data sharing and protecting participants’ privacy. Fan and Pokkunuru [
67] proposed a differential privacy solution for generating high-quality synthetic network flow data, which uses new clipping bound decay and privacy model selection to improve the quality of synthetic data and protects the privacy of sensitive training data by training GAN model with differential privacy. Zhang et al. [
68] proposed a privacy publishing model based on GAN for graphs (NetGAN) [
69], which can maintain high data utility in degree distribution and satisfy
-differential privacy.
Data perturbation can achieve privacy preserving by adding differential privacy noise to training data when using GAN generated synthetic data. Li et al. [
70] proposed a graph data privacy protection method using GAN to perform an anonymization operation on graph data, which makes it possible to fully learn the characteristics of graph without specifying specific features and ensures the privacy performance of anonymous graph by adding differential privacy noise to the probability adjacency matrix in the process of graph generation. Neunhoeffer et al. [
71] proposed differential privacy post-GAN boosting, which combines the samples produced by the generator sequence obtained during GAN training to create a high-quality synthetic dataset and reweights the generated samples using the private multiplication weight method [
72]. Indhumathi and Devi [
73] proposed healthcare Cramér GAN, which only adds differential privacy noise to the identified quasi identifiers, and the final result is combined with sensitive attributes, where the anonymous medical data are used as the real data for training Cramér GAN, Cramér distance is used to improve the efficiency of the model, and the synthetic data generation by health care GAN can provide high privacy and overcome various attacks. Imtiaz et al. [
74] proposed a GAN combined with differential privacy mechanism to generate a real privacy smart health care dataset by directly adding noise to the aggregated data record, which can generate high-quality synthetic and differential privacy datasets and retain the statistical characteristics of the original dataset.
By using label perturbation of differential privacy noise, Papernot et al. [
78] constructed the private aggregation of teacher ensembles (PATE), which provides a strong privacy guarantee for training data. The mechanism combines multiple models trained by disjoint datasets in a black box way. Because these models rely directly on sensitive data, they are not published but used as “teacher” of the “student” model. Because Laplace noise will only add the output of teachers, the students can learn to predict the output chosen by Laplace noisy voting among all teachers and cannot directly access a single teacher, basic data, or parameters. PATE uses moment accountant to better track privacy costs. Building on the GAN and PATE frameworks, Jordon et al. [
75] replaced the GAN discriminator with the PATE mechanism. Therefore, the discriminator satisfies differential privacy, needing a differentiated student version to allow back propagation to the generator. However, this mechanism requires the use of public data.
In objective function perturbation, existing work injects Laplace noise into the coefficients to construct differentially private loss function in GAN training. Zhang et al. [
76] proposed a new privacy protection GAN, which perturbs the coefficients of the objective function by injecting Laplace noise into the latent space based on the function mechanism to ensure the differential privacy of the training data, and it is reliable to generate high-quality real synthetic data samples without divulging the sensitive information in the training dataset.
In addition, the current research mainly focuses on publishing privacy-preserving data in a statistical way rather than considering the dynamics and correlation of the context. Thus, on the basis of triple-GAN [
79], Ho et al. [
77] proposed a generative adversarial game framework with three players based on triple-GAN, which designed a new perceptron, namely differential privacy identifier, to enhance synthetic data in the way of differential privacy. This deep generative model can generate synthetic data while fulfilling the differential privacy constraint.
8.2. Differential Privacy Synthetic Data Generation with Federated Learning
In order to achieve distributed collaborative data analysis, collecting large-scale data is an important task. However, due to the privacy of sensitive data, it is difficult to collect enough samples. Therefore, using GAN can generate synthetic data that can be shared for data analysis. However, in the distributed setting, training GAN faces new challenges of data privacy. Therefore, the existing work provides a solution for differential privacy synthetic data collection by combining GAN and federated learning in a distributed setting. According to the FedAvg training algorithm of model aggregation and averaging, federated learning is achieved by coordinating distributed data with independent and identically distributed and non-IID to perform collaborative learning [
80].
Similar to the idea of gradient perturbation, using weight perturbation can achieve differential privacy of a generative model by clipping weight and adding noise to weight in GAN training with federated learning. Machine learning modeler workflow relies on data checking, so it is excluded when direct checking is impossible in the private and decentralized data paradigm. In order to overcome this limitation, Augenstein et al. [
81] proposed a differential privacy algorithm, which synthesizes examples representing private data by adding Gaussian noise to the weighted average update.
Gradient perturbation can also be used to ensure the privacy protection of training data in GAN training with federated learning. Chen et al. [
62] extended the gradient-sanitized WGAN to train GAN with differential privacy in federated setting and remarked some subtle differences between their method and the method of [
81]. Different hospitals jointly train the model through data sharing to diagnose COVID-19 pneumonia, which will also lead to privacy disclosure. In order to solve this problem, Zhang et al. [
82] proposed a federated differential privacy GAN for detecting COVID-19 pneumonia, which can effectively diagnose COVID-19 without compromising the privacy under IID and non-IID settings. The distributed storage of data and the fact that data cannot be shared due to privacy reasons for the federal learning environment bringing new challenges to training GAN. Thus, Nguyen et al. [
83] proposed a new federated learning scheme to generate realistic COVID-19 images for facilitating enhanced COVID-19 detection with GAN in edge cloud computing, and this scheme integrates a differential privacy solution at each hospital institution to enhance the privacy in federated COVID-19 data analytics. By adding Gaussian noise to the gradient update process of the discriminator, Xin et al. [
84] proposed a differential privacy GAN based on federated learning by strategically combining Lipschitz condition and differential privacy sensitivity, which uses a serialized model-training paradigm to significantly reduce the communication cost. Considering that distributed data are often non-IID in reality, which brings challenges to modeling, Xin et al. further proposed universal private FL-GAN to solve this problem. These algorithms can provide strict privacy guarantee using different privacy, but they can also generate satisfactory data and protect the privacy of training data, even if the data is non-IID.
Furthermore, considering differential average-case privacy [
18] enhancing privacy protection of federated learning, Triastcyn and Faltings [
85] proposed a privacy protection data publishing framework using GAN in the federated learning environment for which the generator component is trained by the FedAvg algorithm to draw private artificial data samples and empirically evaluate the risk of information disclosure. It can generate high-quality labeled data to successfully train and verify the supervision model, significantly reducing the vulnerability of such models to model inversion attacks.
9. Open Problems
We survey that the current work focuses on the definitions, privacy-utility metrics, properties, and achieving mechanisms of GDP and LDP based on the information-theoretic channel model. Mir [
14] obtained the exponential mechanism achieving GDP by minimizing mutual information on the expected distortion constraint. We can intuitively obtain binary randomized response mechanism, quaternary randomized response mechanism, and multivariate randomized response mechanism under the binary symmetric channel, quasi-symmetric channel, and strongly symmetric channel, respectively, in terms of the Equation (
5) of the LDP definition. Wang et al. [
38] obtained the
k-subset mechanism by maximizing mutual information about LDP constraint. Although GDP and LDP have been studied based on the information-theoretic channel model, there are some open problems for different application scenarios and data types from the perspective of different types of information-theoretic channel in
Table 17.
(1) New LDP from the perspective of information-theoretic channel. Because local users have different privacy preferences, Yang et al. [
86] proposed personalized LDP. However, it is necessary to study personalized LDP from the perspective of information-theoretic channel and propose the corresponding achieving mechanism. Although LDP does not require a trusted third party, it regards all local data equally sensitive, which causes excessive protection resulting in utility disaster [
87]. Thus, it is necessary to study the utility-optimized mechanism for the setting where all users use the same random perturbation mechanism. In addition, since the differences between sensitive and non-sensitive data vary from user to user, it needs to propose a personalized utility-optimized mechanism of individual data achieving high utility while maintaining privacy preserving of sensitive data. Holohan et al. [
42] proposed optimal mechanism satisfying
-LDP for randomized response. The optimal mechanism of the randomized response needs to be analyzed and obtained from the perspective of information-theoretic channel. Moreover, a new LDP mechanism needs to be analyzed by using the average error probability [
20] as the utility metric under the rate-distortion framework of LDP.
(2) LDP from the perspective of discrete sequence information-theoretic channel. Collecting multiuser high-dimensional data can produce rich knowledge. However, this brings unprecedented privacy concerns to the participants [
88,
89]. In view of the privacy leakage risk of high-dimensional data aggregation, using the existing LDP mechanism brings poor data utility. Thus, it is necessary to study the optimal LDP mechanism of aggregating high-dimensional data from the perspective of discrete sequence information-theoretic channel. Furthermore, correlations exist between various attributes of high-dimensional data. If the correlation is not modeled, then the high-dimensional correlated data using LDP also leads to poor data utility [
90,
91]. By constructing the discrete sequence information-theoretic channel model of high-dimensional correlated data aggregation using LDP under joint probability or Markov chain, a LDP mechanism suitable for high-dimensional correlated data aggregation needs to be provided.
(3) GDP from the perspective of continuous information-theoretic channel. For GDP, there is no work to show the direct relationship between GDP mechanisms and single symbol continuous information-theoretic channel model, such as Laplace mechanism, discrete Laplace mechanism, and Gaussian mechanism. RDP is a general privacy definition, but existing work did not provide RDP mechanisms under continuous information-theoretic channel model. Thus, RDP mechanisms need to be studied from the perspective of continuous information-theoretic channel. The continuous releasing of correlated data and their statistics has the potential for significant social benefits. However, privacy concerns hinder the wider use of these continuous correlated data [
92,
93]. Therefore, the corresponding GDP mechanism from the perspective of continuous multi-symbol information-theoretic channel needs to be studied by combining the joint probability or Markov chain for continuous correlated data releasing with DP. However, it is common that the data curators have different privacy preferences with their data. Thus, personalized DP [
94] needs to be studied based on continuous information-theoretic channel model. Existing GDP mechanisms ignore the characteristics of data and directly perturb the data or query results, which will inevitably lead to poor data utility. Therefore, it is necessary to study adaptive GDP depending on characteristics of data [
95] from the perspective of continuous information-theoretic channel. Since users have different privacy demands, aggregate data analysis with DP also has poor data utility. Thus, adaptive personalized DP [
96] also needs to be studied based on the type of query function, data distribution, and privacy settings from the perspective of continuous information-theoretic channel.
(4) GDP and LDP from the perspective of multiuser information-theoretic channel. A large amount of individual data have aggregated for computing various statistics, query responses, classifiers, and other functions. However, these processes will release sensitive information compromising individual privacy [
97,
98,
99,
100]. Thus, when considering the aggregation of multiuser data, the GDP and LDP mechanisms need to be studied from the multiple access channel. Data collection of GDP and LDP has been mostly studied for homogeneous and independently distributed data. In real-world applications, data have an inherent correlation which without harnessing will lead to poor data utility [
101,
102]. Thus, when the multiuser data are correlated, the GDP and LDP mechanisms need to be studied from the perspective of the multiuser channel with correlated sources. With the acceleration of digitization, more and more high-dimensional data are collected and used for different purposes. When these distributed data are aggregated, they can become valuable resources to support better decision making or provide high-quality services. However, because the data held by each party may contain highly sensitive information, simply integrating local data and sharing the aggregation results will pose a serious threat to individual privacy [
103,
104]. Therefore, GDP and LDP mechanisms need to be studied from the perspective of the broadcast channel for data releasing and sharing of multi-party data.
(5) Adaptive differential privacy with GAN. Existing work can generate differential privacy synthetic data using GAN. However, because of the differential privacy noise introduced in the training, the convergence of GAN becomes even more difficult and leads to the poor utility of output generator at the end of training. Therefore, it is necessary to explore adaptive differential privacy synthetic data using GAN to generate high-quality synthetic data according to the real data distribution. Combining differential privacy definition and information-theoretic metrics, a new differential privacy loss function model of GAN needed to be studied, and the differential privacy loss function model meets the convergence and reaches the optimal solution. Based on differential privacy loss function model, it is needed to construct adaptive differential privacy model. Using GAN and its variants generates synthetic data under adaptive differential privacy model. To improve the quality of the synthetic data using adaptive differential privacy model, GAN modeling is achieved by more layers, more complex structures, or transfer learning. Moreover, speed of GAN training can be accelerated by reducing the privacy budget. To resolve mode collapse and non-convergence issues, it is necessary to conduct fine tuning of hyper parameters, such as learning rate and number of discriminator epochs. Furthermore, the proposed adaptive differential privacy model with GAN should be extended to a distributed setting by using federated learning, which explores data augmentation methods which can improve the non-IID problem.