A Face Privacy-Protection Model (DIFP) based on a diffusion model is proposed. As shown in
Figure 1, in the training process, a high-quality ID image is obtained from the training dataset, from which an identity code
is derived. Additionally, an image of any style is selected to obtain the style code
. These codes are then combined through a dual-conditional face data generator guided by identity and style. Realism control is added to the original image through iterative latent variable refinement technology to achieve more realistic privacy protection. Finally, the initial latent encoding
is obtained by denoising the noisy latent encoding
, and it is decoded to perform identity recovery.
Section 3.1 introduces the problem formulation and the threat model. The fundamentals of the diffusion model are briefly discussed in
Section 3.2. Finally, this is followed by an in-depth exploration of face privacy protection in
Section 3.3.
3.1. Preliminaries
3.1.1. Problem Formulation
Let us have an original dataset of facial images , where denotes the facial image of the ith individual, and N represents the total number of individuals in the dataset. The objective is to devise a privacy preservation mechanism , underpinned by diffusion models, that transforms original facial images into privacy-preserving synthetic images , all the while maintaining a degree of utility for recognition purposes.
Diffusion models are conventionally defined as a sequence of noise-injection steps progressing from a clean data distribution to a noise distribution , with approximating Gaussian noise. Conversely, in privacy-protection applications, we use the inverse process to approximate from noise to clear data distribution, but in this process, we introduce privacy-protection mechanisms according to the characteristics of the threat model. For the best privacy protection, we use anonymization privacy protection, so that the generated looks similar to the original input but no longer corresponds to any particular original individual at all. At the same time, we also consider data practicality and retain key information for face recovery when necessary.
Privacy preservation aims to minimize information leakage , the mutual information between the original image and the generated image . Meanwhile, to ensure a certain level of recognition performance, the utility of the generated images for legitimate recognition tasks, quantifiable as the accuracy of the conditional probability where x corresponds to the associated label (such as identity), needs to be maximized.
Thus, the core optimization goal encapsulates a trade-off problem:
Here, acts as a balancing parameter, modulating the compromise between privacy protection and utility; signifies a function gauging the utility of recognition, being a recognition accuracy metric, and denotes the magnitude of privacy leakage, computable through mutual information theory.
3.1.2. Threat Model
The threat model refers to the challenges faced by the privacy-protection-enhancement technology we designed, including various types of attacks that try to break its integrity and confidentiality. We construct the threat model from several aspects, such as the adversary objectives, the adversary knowledge level, the objectives of privacy protection and the practicality of the model.
Adversary objectives: the opponents of sensitive facial features in the initial image , usually associated with personal identity privacy.
Data-access and diffusion model relationships: rivals access to public ; this means that the opponent can see the data of some form of output, or it is designed to protect privacy.
Adversary knowledge level: opponents may know the principle of diffusion model before the moment of data associated with the noise of the moment after joining.
Privacy-protection objectives: under the threat model, the privacy-protection objectives are to design a model, make it practical with respect to meeting the data (utility), and at the same time minimize information about the characteristics of leakage, and to find a balance between protecting personal privacy and maintaining image utility.
The practicability of the model: the threat model is designed to generate types. The privacy-protection method provides a framework, allowing one to, in the case of known rival abilities and behaviors, develop an effective privacy policy. Through the definition of this threat model, we can consider the potential rival attack means and strategy and make sure they match the proposed privacy-protection method.
3.2. Overview of Generative Diffusion Models
The diffusion model, a probabilistic generative model, learns the data distribution by reverse-learning a Markov chain with a fixed length T and gradually denoising a normally distributed variable. It generates target data by incrementally introducing random noise into the input data. This process consists of two stages: forward diffusion and backward reconstruction. First, a pre-trained encoder G extracts a latent encoding from the initial image . Subsequently, denoising operations are performed in the latent space, followed by decoding the feature vector into the desired image using the pre-trained decoder D. This process encompasses both encryption and recovery of images for our face privacy-protection model.
During the forward diffusion phase, random noise is incrementally added to the initial image in order to progressively align the data with the target data. This process, also known as a fixed-length Markov chain, involves generating a sample
by introducing noise
at each time step
t given the initial data
. The diffusion process can be mathematically represented by Equation (
2).
where
denotes the current data at time t,
denotes the previous data at time
, the data distribution at time t is related to the data at time
, and the addition of noise,
denotes the added noise,
is the introduced random noise,
is the diffusion coefficient at time step t, and determines the size of the introduced noise, which is usually uncertain, according to the needs of the task. As time steps increase, we introduce smaller noise in the early stages of diffusion and larger noise in the later stages. The value of
can be either a learnable parameter of the model or a manually set hyperparameter, that the image data at time t generated under the condition at time
is subject to the standard normal distribution, as shown in Equation (
3).
where
denote the conditional probability distribution of the data
at the current time step
t, given the data
at the previous time step
. Normal distribution
is a continuous probability distribution.
denotes the data at time step
t,
is a coefficient that regulates the increase in noise in the diffusion process at each step, thereby adjusting
its impact
.
denote the data at time step
, representing the noisy data from the previous time step.
denotes the Gaussian noise that is added to each step of the diffusion process, and its variance is controlled by
the identity matrix
I.
denotes the time step
t from 1 to
T, where
T is the total number of steps of the diffusion process.
In the backward reconstruction stage, which is initiated from the final outcome of diffusion, gradual denoising is employed to restore the data to its target form. A decoder
D predicts the mean and covariance of
for a given input
. The model is trained using the standard Mean Square Error (MSE) loss to predict the added noise, represented by
using the parameterized Gaussian transfer Equation (
4).
where
denotes the conditional probability distribution of the data
at the previous time step
, given the data
at the current time step
t, in the reverse denoising process. The
denotes the model parameters.
represents the normal distribution, a continuous probability distribution.
denote the data at time step
,
denote the data at time step
t and
denote the mean of the conditional distribution, which is a function of
and time step
t, computed from a network model with the network parameterized as
.
denotes the covariance matrix of the conditional distribution, a function of
and time step
t, also computed from the network model.
The mean
can be obtained either through Bayes’ theorem or by predicting the noise
using Equation (
5). The weight
in Equation (
6) represents the impact of gradually adding noise. The covariance
incorporates an additional loss term
(Equation (
7)) to maximize the likelihood of the model while minimizing the
divergence between the distribution of the eigenvariate and the prior distribution. This process interpolates between the upper and lower bounds of the original fixed covariance, facilitating a smoother approximation of the true distribution during learning.
where
represent the predicted data mean of the previous time step in the backward process, considering the current noisy data
at time step
t.
denote the mean and
denote the model parameters.
denotes the data at time step
t, and
is a predetermined scheduling parameter, typically between 0 and 1, that controls the amount of noise added from
to
.
is the coefficient associated with
controlling the addition of noise.
denote the
cumulative product from time step 1 to
t. The model parameters determine
, a random variable that represents the predicted value of the noise term at time step
t.
signify the addition of noise to the data at time step
t.
Equation (
6) shows the
calculation method in Equation (
5), which ensures the smoothness of the forward diffusion process and the accuracy of the backward denoising process, as well as the flexibility of designing the diffusion process. By considering
and
, the noise growth in each step can be more finely controlled.
captures the cumulative effect of noise from the start to the current step, while
focuses on noise addition at the current step, providing precise prediction and noise removal for the reverse denoising process.
where “
” represents the calculation of
divergence, which is used to measure the similarity between the probability distribution of the real data and that of the model-generated data. “
” stands for expectation operation, which describes the expectation with respect to the distribution
.
represents the distribution of the eigenvariables,
denotes the generated data distribution, and
refers to the divergence measure known as
divergence. The removal of back-guidance noise facilitates the gradual restoration of the data to its initial state.
Our face privacy-protection model achieves the gradual generation of complex privacy-protected images that resemble the target image from a simple input face image. This is accomplished through noise guidance and control, along with the gradual reduction of noise, ultimately aiming to closely approximate the original face image.
3.3. Framework for Face Privacy Protection
According to the adversary’s ability defined in the threat model, we use differential privacy to formulate a guidance strategy for adding identity and style conditions to the diffusion model and obtain a rich and diverse anonymized face image. Although the adversary knows the working principle of the basic diffusion model, only the interference image can be obtained after denoising. We assume that the adversary model is not available, which makes our method both confidential and practical to recover the face image and perform the corresponding task when necessary.
3.3.1. Diffusion Models Guided by Identity and Style Conditions
In our approach, we utilize the U-Net structure, which is a fundamental component of the diffusion model, for post-hoc prediction. It takes inputs such as the current time step
and the associated conditional identity
and style
. Modifications are made to the parameter changes of the mean
and covariance
during the state transition from one time step to the next, incorporating conditions to facilitate conditional generation control. As shown in
Figure 2, this enhances both flexibility and controllability of the generated results. By adjusting the level of conditional guidance, it tailors the resulting images more towards specific identity and style attributes rather than solely relying on random noise. The conditionally generated Gaussian transitions are represented by the following Equation (
8). To model the reverse process in a diffusion model, we use identity and style as conditional inputs to make the model take into account the details of identity and style when generating images, thus generating what we hope will be more accurate and consistent images.
where
represents the probability density function that generates the previous time step state
based on the conditions
and
. The parameters
and
correspond to the adjusted mean and covariance parameters, respectively, which are influenced by the identity and style conditions.
In practice, similar to previous research, the conditional denoising process is trained to predict noise while incorporating additional identity and style information, represented as .
To regulate the influence of identity and style conditions, we adopt a classifier-free guidance method that allows the generative model to be guided without relying on specific classifier labels. This enables conditioning the model during generation without requiring category information. The adjustment of dual-conditional guidance is achieved by introducing conditional vectors into the generation process. We employ a two-stage training strategy, segregating the training process into distinct phases with different objectives and configurations, as expressed by Equation (
8). The initial stage trains the model with complete identity and style conditions to expedite its learning of condition correlations from the beginning, ensuring a robust starting point. During this phase, the loss function encompasses generating complete conditions, as expressed by Equation (
9), used to optimize model parameters during training to better predict noisy data, it measures the difference between the noise
predicted by the model and the actual noise
. By minimizing a loss function, the model learns how to recover data from noisy data while taking into account the guiding information of identity and style, which can help generate images that are consistent with the identity and style of the input.
In the loss function, represents the actual noise term, while the noise term predicted by the model is influenced by identity and style conditions. The norm signifies the Euclidean distance between two vectors, representing expectations for all time steps t, initial image and noise term in . This formulation enables the model to generate outputs closely resembling actual noise terms during the initial stage of training, facilitating correct learning of correlations with identity and style conditions.
The model’s performance is enhanced in the second stage through iterative refinement and optimization of its parameters. During this stage, the model undergoes further training with a subset of conditions, where only 40% of the original pixels associated with each condition are randomly replaced to augment the training data. This process helps achieve unconditional representation and enhances the model’s ability to generate outputs across a wider range of conditions, thereby improving its generalization capabilities. By employing two distinct training phases, we can effectively guide the model’s learning process, allowing it to focus on specific objectives and tasks at different stages. In the sampling phase, Equation (
10) is used to regulate the balance between identity and style fidelity, facilitating flexible guidance of the generation process without requiring specific classifier labels. This methodology enhances the adaptability and utility of the generative model.
During generation, the model computes a base unconditional noise term
and then adds the influence of the identity and style conditions back into the generation process, respectively, depending on the values of
and
. By adjusting the values of
and
, we can control the influence of the input identity and style conditions on the generated image.
where
is a final noise term that depends on the current time step t and the current noisy data
, as well as the identity condition
and style condition
;
denotes the model parameters; ⊘ is used to generate unconditional noise terms that do not depend on identity or style.
and
are control parameters to balance the influence of identity and style on the generation process.
and
denote the noise terms in the identity-only and style-only conditions, respectively.
denotes the noise term without the influence of identity or style conditions.
During the second stage, the model utilizes partial conditions and modifies the loss function to incorporate the portion generated unconditionally. Equation (
11) displays the loss function.
where
represents the expected operation;
t is the time step of the diffusion process;
represents the input data without noise;
denotes the actual noise term added to the data
at time step
t, while the noise term predicted by the model
is influenced but without considering identity and style conditions.
signifies the
norm, indicating the Euclidean distance between the two vectors.
represents expectations for all time steps, initial image and noise term. During the second phase of training, the objective is to refine the model’s generation outcomes to closely align with the actual noise term while disregarding identity and style conditions. This enhances the model’s capacity for generalization across a broader range of scenarios and improves its ability to generate diverse privacy-protected images in a more controlled manner.
Our training strategy incorporates a condition-controlled approach to guide the diffusion model, allowing for different levels of guidance, including full, partial and no guidance conditions. This enhances the model’s ability to generate a diverse range of facial images while ensuring privacy protection. By adjusting the level of conditional guidance, we can personalize the generated images in terms of identity and style, thereby increasing their diversity and expressive potential.
3.3.2. Realistic Control
To address concerns regarding the authenticity and trustworthiness of privately generated facial images, we propose a mechanism that ensures reliability and realism in the output. Our objective is to modify only the visual attributes of the face while maintaining consistency with other contextual factors, thereby guaranteeing the authenticity of private images. We present a three-fold control approach that integrates identity and style guidance alongside realism control. This technique aims to generate private images that closely resemble their original counterparts while adhering to conditional constraints. Specifically, we introduce realism control by iteratively refining latent features and employing two-dimensional classifier-free guidance for low-pass filtering downsampling on the initial image. These additional adjustments enhance the generation process, resulting in greater realism and authenticity in the produced images.
During the inference stage of the generative model, it is crucial to enhance the quality of generated privacy images and ensure they meet specific conditions by refining latent features. To achieve this, we propose an iterative method for refining latent features in the generated image. This method involves multiple rounds of adjustments to iteratively improve the image quality while maintaining consistency with the provided identity and style information. We accomplish this by progressively refining the latent variables using a downsampled original image at each intermediate transformation stage of the generation process. The objective is to align the generated privacy-preserving face images more accurately with user-provided input or adhere closely to the data distribution of the original image.
To enhance the authenticity of the generated images while preserving identity and style details, we employ a linear low-pass filtering technique
to reduce image resolution and minimize high-frequency elements. Subsequently, the filtered image is downscaled to a size denoted by
H and then upscaled back to its original resolution. The value of
H is determined based on the realism scale parameter, and a reference image containing both identity and style information is used for fine-tuning the resulting image’s consistency with the reference. As shown in Equation (
12), the realism of the image is controlled by the iterative latent variable refinement method in the generation process, which combines the predicted data
and the reference image
processed by the low-pass filter to generate the final image
.
where
represents the final generated data at time step
, which is computed by combining
with a reference image
.
denotes a linear low-pass filter operation that downsamples the input image to a transform size
H and then upsamples it back to the original size. This process helps to adjust the high and low frequency content of the image, thus controlling the detail and realism of the image.
denotes the reference image at time step
, which is obtained by gradually injecting noise into the input identity and style combined image
. This reference image contains identity and style information that is used to guide the generation process.
The realism scale factor adjusts the similarity between the resulting image and the original image, ensuring that the generated image maintains authenticity while closely matching the appearance and style of real image data distribution. This is achieved by controlling specific image parameters, as indicated in Equation (
13).
and
M determine the value
H to control the scale of the low-pass filter LP’s downsampling and upsampling operations, which are part of the iterative latent variable refinement process to adjust the image’s detail and realism during generation. By manipulating the realism scale factor, the generation quality of privacy-protected face images can be fine-tuned, dynamically balancing condition control and image realism during the generation process to yield more realistic privacy images.
where
H represents the parameter determining the size of the image transformation (downsampled size), while
denotes the realism scale, controlling the balance between consistency and realism during image generation. As shown in Equation (
14), as
approaches 0,
H increases, preserving more high-frequency information and detail akin to the original image, albeit with potential distortion. Conversely, as
approaches 1,
H decreases, resulting in a smoother image more aligned with the target data distribution. Here,
M denotes the size of the reference image
, and
k represents a constant term.
where
is the generated image,
is the reference image and
is the structural similarity index.
represents the similarity between the generated image and the reference image. If the similarity between the generated image and the reference image is higher, the value of
will be closer to 0. If the similarity between them is low, the value of
is close to 1.
3.3.3. Face Identity Recovery
In this section, encoder
G is used to extract key features from the original image and convert it into latent space
, preserving essential original information
that represents the image’s latent space representation for face image restoration purposes. During the privacy-protection process described in
Section 3.2, noise is added to the initial latent code
to obtain the latent code
. Subsequently, using inversion [
34], a map containing noise serves as the key information
. Equation (
15) calculates the time steps for noise addition, determining when noise should be introduced throughout the privacy-protection process. Adding noise at this stage helps preserve the global structure and characteristics of the image while obscuring facial appearance, making facial features less visually discernible and effectively concealing individual identity.
where
T represents the total number of steps involved in noise removal, while
denotes the scaling factor responsible for regulating the intensity of noise. This scaling factor allows for adjusting the noise intensity, thereby influencing the extent of noise introduction and removal. The choice of time step directly impacts the intensity of noise manipulation.
During the face restoration process, an iterative update of the noise image
produces the denoised face restoration image. Equation (
16), the key step in generating the face restoration image, utilizes the conditional embedding of the original image’s key information to restore the noisy image. Specifically, an updated latent representation
corresponding to the denoised recovered image at time step
is generated by combining the noisy image
and the noise term
with the key information of the original image. We repeat this process until we generate the final noise-free image. This way, the diffusion model gradually recovers high-quality face images from noise while preserving the details of the original data distribution.
where
represents the process guided by the key information
and
, as well as the model parameter
. This noise term is used to add details in the inverse process to recover the original face image with high quality.
and
represent the conditional guidance used at time step t, which is the key feature information extracted from the pre-trained diffusion model and used to guide the image restoration process. Let
denote the parameters of the U-Net model, which are optimized during training to improve the quality of the recovered face images.
Equation (
17) is an iterative step used for image generation in the face image restoration process, guides the denoising process of the
step to obtain
x. By employing latent encoding and raw image guidance, an initial encoding called “denoised” is generated to facilitate the reconstruction of an image that closely resembles the original. This ensures the production of a high-quality restored image, as shown in
Figure 3.
where
represents the estimated “clean” latent coding derived from
,
denotes the learning rate and
represents the loss gradient, indicating the gradient for target
. This loss metric evaluates the similarity between the resulting image and the original.
The initial denoised latent code
is obtained and then decoded to reconstruct the image representation. This process involves mapping the latent code back to the image space through an inverse transformation, resulting in the recovered image
. The decoder takes
as input and utilizes learned weights and bias parameters to generate a representation that increasingly resembles the original image. The recovered image
is generated using Equation (
18).
where
represents the recovered face image after denoising, Dec represents the decoder and
represents the latent coded image in the latent space.