1. Introduction
Deepfake [
1] constructs generator models based on generative adversarial networks (GANs) to forge images. Receiving real images as input, the deepfake model can output fake images by, for example, changing hair color. Deepfake has played an important role in the entertainment and culture industry, bringing many conveniences to life and work. However, malicious users may take advantage of this technology to produce fake videos and news, misleading face recognition systems and seriously disrupting the social order [
2,
3].
In order to cope with deepfake tampering with images, a large number of studies focus on constructing deepfake detection models [
4,
5,
6,
7,
8,
9,
10], which can detect whether an image is faked. However, this technology can only ensure the authenticity of the image, but it cannot guarantee the integrity of the image. Moreover, even if an image is confirmed a fake, negative impacts are caused on the people concerned or on society because the image has already been widely circulated. More direct interventions should therefore be taken to ensure that images are not tampered with though deepfake from the source.
Some studies propose the use of the adversarial attack [
11] to interfere with the work of the deepfake model. The main idea of the adversarial attack is adding a perturbation, imperceptible to the naked eye, to the original example, generating the adversarial example, which can mislead the deep learning models to produce a quite different output. The adversarial attack was originally used to destroy security systems such as face recognition, which posed a huge challenge to the security of deep learning models. However, if the object of an adversarial attack is turned into a malicious model such as through deepfake, the meaning of the adversarial attack becomes dramatically opposite: to disrupt the normal operation of malicious models to guarantee information security. As shown in
Figure 1, the specific operation involves adding a perturbation, imperceptible to the naked eye, to an image when users post the image online so that when the attacker obtains the image, the fake version generated through deepfake will have obvious distortions or deformations, which can be easily identified as forgery.
In the current study, however, the generalization of the adversarial attack against the deepfake model is very limited: an adversarial example generated for a specific deepfake model is unable to produce equal attack effect on other models [
12]; furthermore, even in the same model, an adversarial example generated in a particular domain cannot achieve effective attack in other domains [
13] (by setting the corresponding conditional variables, deepfake models can generate multiple domains of forged images, such as the hair color or the gender of the face image). Without the knowledge of what deepfake model will be employed or what conditional variables will be set to tamper with images, the adversarial attack methods currently studied have great limitations in practice.
In order to improve the generalization of the adversarial attack, that is, to generate the adversarial samples in each domain of multiple models of given images, this paper proposes a framework of Cross-Domain and Model Adversarial Attack (CDMAA): Any gradient-based adversarial example generation algorithm can be used for an adversarial attack, such as the I-FGSM [
14]. In the backpropagation phase, the algorithm uniformly weights the loss function with different condition variables in the model to extend the generalization of the adversarial example between various domains. The Multiple Gradient Descent Algorithm (MGDA) [
15] is used to calculate the weighted sum of the gradients of each model to ensure the generalization of adversarial examples between various models. Finally, we propose a penalty-based gradient regularization method to further improve the success rate of adversarial attacks. CDMAA can expand the attack range of the generated adversarial example and ensure that the images are not tampered with and forged by multiple deepfake models.
2. Related Work
According to the category of model input, some deepfake models input random noise to synthesize images which were entirely non-existent before [
16], such as ProGAN [
17], StyleGAN [
18], etc. Other deepfake models input real images to achieve the image translation from domain to domain. For example, StarGAN [
19], AttGAN [
20] and STGAN [
21] can translate the facial images in domains by setting different conditional variables, such as hair color, age, etc. Unsupervised models, such as CycleGAN [
22] and U-GAT-IT [
23], can only translate images to a single domain, which can be considered a special case of multi-domain translation models with a total domain number of 1. This paper focuses on image translation deepfake models and performs the adversarial attack on them to interfere with their normal functions and protect real images from being tampered with.
The adversarial attack was initially applied to classification models [
24]. Goodfellow et al. proposed the Fast Gradient Sign Method (FGSM) [
25]. The FGSM sets the distance between the model output of the adversarial example and the model output of the original example as the loss function. The gradient of the loss function with respect to the input indicates a direction where the output difference between the adversarial example and the original example ascends fastest. Therefore, the FGSM adds the gradient in the original example to generate an effective adversarial example. Kurakin et al. proposed the iterative FGSM (I-FGSM) [
14], which iteratively performs gradient backpropagation to reduce the step size of updating adversarial examples and improving their efficiency. Many studies have since proposed various adversarial attack algorithms to optimize the efficiency of adversarial attacks, such as PGD [
26], which uses random noise to initialize the adversarial examples, the MI-FGSM [
27], which uses momentum to update the gradient, and APGD [
28], which automatically decreases the step size.
Kos et al. [
29] first extended adversarial attacks to generation models. Yeh et al. [
30] first proposed to attack the deepfake model. They used PGD to generate adversarial examples against CycleGAN, pix2pix [
31], etc., which can distort the output of these models. Lv et al. [
32] proposed that higher weight should be given to the face part of the images when calculating the loss function so that the output distortion generated by the adversarial examples is concentrated on the face to achieve a better effect of interfering deepfake models. Dong et al. [
33] explored the adversarial attacks on encoder–decoder-based deepfake models and proposed to use the loss function with respect to latent variables in encoders to generate the adversarial examples. These studies generate adversarial examples only for certain models and do not take into account that models can output fake images of different domains by setting different condition variables, so the generalization of adversarial attacks is quite limited.
Ruiz et al. [
13] considered the generalizability of adversarial attacks across different domains. They verified that the adversarial example generated in a particular domain cannot achieve effective attack in other domains of the model and proposed two methods of iterative traversal and joint summation to generate adversarial examples that are effective for each domain. However, they did not consider the generalization between different models of the adversarial examples. Since the differences between models are much larger than the differences between domains within models, the simple method of iterative traversal or joint summation cannot be equally effective for attacks between different models.
Fang et al. [
34] considered the generalizability of adversarial attacks across models. They verified that the adversarial examples against a particular model are ineffective in attacking other models and proposed a method of weighting the loss functions of multiple models to generate adversarial examples against multiple deepfake models, where the weighting factors are found by a line search. However, the tuning experiments are extremely tedious because the weighting coefficients need to be found in
dimensional parameter space, where
denotes the number of models. In addition, the coefficients need to be adjusted when attack models are changed, which is quite inefficient.
Compared with the existing work, this paper focuses on extending the generalization of adversarial examples across various domains and models and proposes a framework of CDMAA. CDMAA can generate adversarial examples that can attack multiple deepfake models under all condition variables with higher efficiency.
3. CDMAA
In this paper, we use the I-FGSM as the basic adversarial attack algorithm to introduce the CDMAA framework. In the model forward propagation phase, we generate the cross-domain loss function of each model by uniformly weighting the loss function corresponding to each conditional variable. In the phase of model backward propagation to calculate the gradient, we use the MGDA to generate a cross-model perturbation vector from the gradient of each cross-domain loss function. The perturbation vector is used to iteratively update the adversarial example to improve its generalizability across multiple models and domains.
3.1. I-FGSM Adversarial Attack Deepfake Model
Given an original image
, its output of the deepfake model is
, where
denotes the deepfake model and
c denotes the conditional variable. The I-FGSM generates the adversarial example
by the following steps:
where
denotes the adversarial example after
t iterations,
t does not exceed the number of iteration steps
T,
denotes the step size,
is the symbolic function,
denotes the perturbation range and the
function restricts the size of the adversarial perturbation not to exceed the perturbation range in
-norm, i.e.,
so that the difference between the adversarial example and the original image is sufficiently small to ensure that the original image is not significantly modified,
L denotes the loss function, which uses the mean squared loss (MSE) to measure the distance between the output of the adversarial example
and the output of the original image
[
30]:
where
D denotes the dimensionality of the model output, i.e., the
of the output image.
The adversarial example is updated towards the optimization goal:
The generated adversarial example is considered to have successfully attacked the deepfake model G under the condition variable c when the loss function L keeps increasing and reaches a certain threshold , i.e., the output image has a sufficiently noticeable distortion.
3.2. Cross-Domain Adversarial Attack
To extend the generalizability of the adversarial examples across various domains of the model, i.e., the optimization objective (5) is modified as
where
denotes the
ith conditional variable of model
G and
K denotes the total number of conditional variables.
The gradient for each of the loss functions in the above optimization objectives is calculated by
where
indicates the optimization direction for maximizing the distortion of the output of model
G with condition variable
for the current adversarial example
.
Since the backbone network is fixed in the model, changing only the condition variables has less impact on the model output, resulting in the loss functions
and their corresponding gradients
of different condition variables being very similar, i.e., the
have approximately the same direction, as shown in
Figure 2a. Therefore, we integrate a cross-domain gradient
by simply uniformly weighting
:
is obtained by integrating the loss functions corresponding to each conditional variable so that they indicate a common direction of maximizing the loss function of each domain. Using to update the adversarial example can achieve the optimization objective of (6).
Consider the following equation:
That is, we can uniformly weight the loss function corresponding to each condition variable to obtain a cross-domain loss function and then calculate the gradient of it with respect to , which is the cross-domain gradient . It ensures that only one backpropagation step is performed for each model so that time consumption is reduced.
3.3. Cross-Model Adversarial Attack
We further extend the generalizability of the adversarial examples across models, i.e., the optimization objective in (6) is modified:
where
denotes the
jth deepfake model and
J denotes the total number of deepfake models.
The group of cross-domain gradients has been obtained from
Section 3.2:
where
denotes the cross-domain gradient of the
jth model. Considering that these gradients come from different models, the large differences between models lead to a low similarity between the gradients, as shown in
Figure 2b. Simply iterative traversing or uniform weighting these gradients is prone to a large fluctuation in the optimization process and generates an ineffective adversarial example [
35].
In order to derive a cross-model perturbation vector
from the group of gradients
to update the adversarial example, the CDMAA framework draws on the idea of the Multiple Gradient Descent Algorithm (MGDA) to give an idea for finding
:
The space
that the vector
values in satisfies:
Theorem 1. The solutionin (12) is the optimization direction in which the loss function corresponding to each model is increasing for the current adversarial example, i.e., it satisfies:
Proof. Equation (12) is equivalent to the following optimization problem:
To solve this extreme value problem of a multivariate function under the linear constraint, construct the Lagrange function:
since
; hence
and
According to the Lagrange multiplier, the equation
is a necessary condition for
obtaining the minimum of
; hence,
considering that
In the actual adversarial attack scenario, since the dimension
D is much larger than the number
J of the gradients in
, it is almost impossible for these gradients to be linearly dependent. Hence, their linear combination
and
Uniting
, there is
Simultaneously, (19), (20) and (14) are proven. □
Since the vector product of and the gradient of all model loss functions is positive, optimizing the adversarial example with ensures the whole improvement of loss functions in each model, i.e., the optimization objective (10), which can improve the generalization of the adversarial examples in various models.
3.4. Gradient Regularization
In
Section 3.3, if the gradients group
is regularized as
and then the MGDA is used on the regularized gradients group
to find a perturbation vector
, the result of (14) holds because
where
is the regularization factor.
Common regularization methods include L2 regularization: , which scales the gradients to the unit vector; logarithmic gradients regularization: , which reduces the gradients by the factor of their corresponding loss function value.
Due to the large difference in norms of each of the cross-domain gradients which are calculated from various models, the resulting vector is expected to be mostly influenced by the gradients of small norms in . In addition, without some constraints and guidance methods, the generated adversarial example will form an obvious “attack preference” due to the different vulnerability of deepfake models, only achieving high attack effect on the vulnerable models, which eventually leads to a large difference of different models.
To lead the cross-model perturbation vector in the direction of improving the effectiveness of attacks on models that are not vulnerable to adversarial attacks and maximize the success rate of adversarial attacks against all models, we propose a penalty-based gradient regularization method:
where
denotes the cross-domain loss function of the
jth model and
is a very small positive number to prevent the zero-denominator error of
when
. (The value of the loss function
is 0 inevitably in the first iteration of the I-FGSM since the current adversarial example is the same as the original image).
The significance of using this gradient regularization is as follows:
According to (19), the
derived from the regularized gradient
satisfies
Consider the first-order Taylor expansion of the loss function
at the
tth iteration:
where the first approximately equal sign ignores the effect of taking the sign function for
on the result, and the second approximately equal sign ignores the remainder of the first-order Taylor formula for approximation.
Uniting (26) and (27), there is
The last equal sign of the above equation can be held by taking a sufficiently small . Since is a constant, each model’s value of the loss function change is inversely proportional to their current corresponding loss function value , implying that the smaller the value of the loss function, the larger the optimization gain can be obtained. In practical adversarial attacks, the adversarial examples achieve successful attacks on some vulnerable models after a small number of iterative steps, as their corresponding loss functions have reached the threshold. It is meaningless to further improve these loss functions. Using this regularization can make the adversarial example mainly optimized in the direction of improving the loss functions that have not reached the threshold, which can improve the attack effect on their corresponding models and pursue a higher comprehensive attack success rate.
3.5. CDMAA Framework
In summary, this paper proposes a framework of adversarial attacks on multiple domains of multiple models simultaneously. Using the I-FGSM adversarial attack algorithm as an example, the procedure of CDMAA is as follows (Algorithm 1):
Algorithm 1 CDMAA |
Input: original image , iterative steps T, perturbation magnitude , step size a, deepfake model group Output: adversarial example Initialization: - 1.
Fortodo - 2.
Fortodo - 3.
- 4.
- 5.
- 6.
End for - 7.
- 8.
- 9.
End for - 10.
|
Step 3 is the use of the uniformly weighting method to obtain the cross-domain loss function, which is sufficiently effective due to the similarity of the gradients between domains (
Section 3.2). It needs only one backpropagation step to calculate gradient in the following step, while the MGDA needs
backpropagation steps to calculate the gradients in each domain, thus ensuring good efficiency.
Step 7 is the use of the MGDA to obtain the cross-model perturbation vector, where a simple uniformly weighting method is less effective due to the low similarity of gradients between models (
Section 3.3). Therefore, the MGDA is used to achieve better attacks at the expense of time. We use the Frank–Wolfe method [
36] to approximately calculate the minimal norm vector in the convex hull of
, which has a well convergence in such cases that the number of dimensions is much larger than the number of vectors [
35,
37].
Figure 3 shows the overview of CDMAA. It is noted that CDMAA is not necessarily applied on the I-FGSM, although this paper uses the I-FGSM to introduce CDMAA. The main idea of CDMAA is to obtain a perturbation vector from gradients in multiple domains and models and then update the adversarial examples to ensure their ability of attacking multiple models and domains. Therefore, CDMAA can be applied to any gradient-based adversarial attack algorithms, such as the MI-FGSM and APGD.
4. Experiment and Analysis
To verify the effectiveness of the proposed CDMAA framework, we conduct adversarial attack experiments against deepfake models and analyze the results. In
Section 4.1, we introduce the deepfake model, hyper-parameters and evaluation criteria used in the experiments. In
Section 4.2, we use CDMAA to attack four models at the same time and show the result of adversarial attacks. In
Section 4.3, we conduct ablation experiments to show the impact of CDMAA components on the attack and compare with the methods used in the existing work.
4.1. Deepfake Models, Hyper-Parameters and Evaluation Metrics
We prepared four deepfake models—StarGAN, AttGAN, STGAN and U-GAT-IT—for the adversarial attack experiments, which are chosen in similar existing work [
12,
13,
34]. StarGAN and AttGAN adopt the officially provided pre-training models, which are training respectively in five domains—black hair, blonde hair, brown hair, gender and age—as well as in 13 domains, such as bald head and bangs on the celebA dataset. STGAN uses the model trained on the celebA dataset in five domains—bangs, glasses, beard, slightly opened mouth and pale skin—which are rare attributes in original images. We selected these domains to prevent the possibility that the STGAN output will be the same as the input when the original picture already contains the attributes of the corresponding domains; in which case, the experiment results will be affected since the model is unable to effectively forge the images even without adversarial examples [
34]. U-GAT-IT realizes the translation of images from a single domain to another, so it can be regarded as a special case of multi-domain deepfake when the total number of conditional variables
. To unify the dataset in the experiments, we used the U-GAT-IT model to translate from celebA to the anime dataset, which is trained on official codes.
The adversarial attack algorithm uses the I-FGSM, in which the hyper-parameters refer to the settings in the existing work, , except where noted. The test data are randomly sampled pictures in the celebA dataset (ensure that the pictures used in each contrast experiment are the same).
The value of the loss function
is used to quantify the output distortion of the
nth adversarial example
to model
under the condition variable
. The following evaluation criteria [
13] are considered to evaluate the effectiveness of the adversarial attack:
where
represents the average value of the loss function of
N adversarial examples in each domain for model
and
represents the proportion of the loss function of
N adversarial examples in each domain reaching the threshold
[
13] for Model
, i.e., the attack success rate.
4.2. CDMAA Adversarial Attack Experiment
We used CDMAA framework to attack the four deepfake models StarGAN, AttGAN, STGAN and U-GAT-IT at the same time. The results are shown in
Table 1:
The results show that the generated adversarial examples achieve certain effects on each domain of the four deepfake models. StarGAN and U-GAT-IT are relatively vulnerable to adversarial attacks because the average L values is much greater than the threshold and the success rate of attack is close to 100%, respectively. The success rates of attacks on AttGAN and StarGAN are relatively lower; AttGAN and StarGAN are relatively less affected by the adversarial attack.
In addition, comparing the three groups of experiments (a), (b) and (c), we see the attack effect can be improved by relaxing the limit of algorithm parameters, such as increasing or T (at the expense of a more obvious perturbation or a larger computational cost). Comparing the three groups of experiments (a), (d) and (e), we find that using a better adversarial attack algorithm (the MI-FGSM is an improvement on the I-FGSM and APGD is an improvement on the MI-FGSM) can improve the attack effect. Both the MI-FGSM and APGD perform J gradient backpropagation in each iteration, which is the same as the I-FGSM. Therefore, they have the same algorithm time complexity O(T) and roughly similar computational cost. All this shows that the CDMAA framework is well compatible with adversarial attack algorithms, and the general improvement methods are also applicable to CDMAA.
Figure 4 shows the attack effectiveness of some of the adversarial examples in the above experiment:
Figure 4 shows that the difference between the adversarial example and the original image is so small that the human eye can hardly distinguish it. However, the difference between the output of the deepfake model, i.e., the distortion of the fake image, is large enough to be distinguished. Therefore, using the adversarial example instead of the original image can significantly deform the output of the deepfake model so as to effectively prevent the model from forging pictures.
4.3. Ablation/Contrast Experiments
4.3.1. Cross-Domain Gradient Ablation/Contrast Experiment
To verify that the method of uniformly weighted cross-domain gradients used by CDMAA can effectively expand the generalization of adversarial examples between various domains, we carry out the contrast attack experiment, where we keep other components of CDMAA unchanged and only change the way to handle different gradients in domains: (1) Single gradient:
, i.e., use the gradient of only one domain as the cross-domain gradients, without considering the generalization of the generated adversarial examples in other domains. This problem exists in most current studies [
30,
32,
33]. (2)
Iterative gradient: , i.e., iteratively use the gradient of each domain loss function as the cross-domain gradients [
13]. (3) The
MGDA:
, i.e., use the MGDA to generate cross-domain gradients. The results are shown in
Table 2:
Figure 5 shows the visual comparison of the result. Compared with existing research on adversarial attacks against deepfake, which only use single domain gradients or iterative gradients in each domain, the CDMAA framework, using the method of uniform weighting to generate cross-domain gradients, can achieve a higher attack success rate against each model, especially those with more domains, such as AttGAN, and effectively increase the generalization of adversarial attack examples between domains. The
average L of some models using the method of single gradient or iterative gradient can exceed CDMAA, which shows that the effectiveness of the adversarial examples generated by these two methods on each domain varies greatly, which is not as well-balanced and stable as CDMAA. In addition, compared with using the MGDA to generate cross-domain gradients, the effectiveness of simply using uniform weighting is not quite different, but it can greatly reduce the time consumption (
Section 3.5). Therefore, CDMAA uses the most efficient uniform weighting method to calculate the cross-domain gradients.
4.3.2. Cross-Model Perturbation Ablation/Contrast Experiment
To verify that CDMAA uses the MGDA to calculate the cross-model perturbation vector
w, which can effectively expand the generalization of adversarial examples between various models, we carry out the contrast attack experiment, where we keep other components of CDMAA unchanged and only change the way to process each cross-domain gradient: (1) Single gradient:
, i.e., only use the cross-domain gradients of one model to update the adversarial example, which is equivalent to ignoring whether the adversarial example has the generalization ability to attack other models. (2)
Iterative gradient:
, i.e., iteratively use the cross-domain gradients of each model to update the adversarial example [
12]. (3)
Uniform weighting:
, i.e., use the mean of the cross-domain gradients of each model to update the adversarial example [
34]. The results are shown in
Table 3:
Figure 6 shows the visual comparison of the results. Although the methods of single gradient, iterative gradient and uniform weighting used in the current research can reach a high
average L value on some models, such as StarGAN, their effectiveness on models that are robust against adversarial attacks (such as AttGAN and STGAN) are very poor. In fact, it is meaningless to reach such a high
average L value: as long as the threshold
is exceeded, the output distortion is obvious enough and a successful adversarial attack is achieved. In contrast, the group “MGDA” can achieve a considerable attack success rate for each model. Therefore, we use the MGDA to calculate cross-model perturbation to improve the generalization of adversarial examples across models.
4.3.3. Gradient Regularization Ablation/Contrast Experiment
To verify the effectiveness of gradient regularization used in CDMAA, we carry out the attack contrast attack experiment, where we keep other CDMAA components unchanged and only change the gradient regularization method used: (1) Without regularization:
, i.e., the regularization factor is always 1, which is equivalent to not using a regularization method. (2)
L2 regularization:
. (3) Logarithmic gradient regularization [
15]:
. The results are shown in
Table 4:
Figure 7 shows the visual comparison of the result. On the metric of
attack_rate, the method of penalty-based gradient regularization is superior to other gradient regularizations. It achieves a more uniform attack effect distribution on each model by reducing the effect on the model with large loss function value in exchange for a major attack on the model with small loss function value. In the actual process of adversarial attacks, due to the large gap in the vulnerability of each model to attacks, the use of this gradient regularization method will be more practical.
5. Conclusions and Future Work
In this paper, we propose a framework of an adversarial attack against the deepfake model called CDMAA, which can expand the generalization of the generated adversarial examples in each domain of multiple models. Specifically, using CDMAA to generate adversarial examples can distort fake images, i.e., the output of multiple deepfake models under any condition variables, so as to interfere with the deepfake model and protect the pictures from model tampering. An adversarial attack experiment on four mainstream deepfake models shows that the adversarial examples generated by CDMAA have high attack success rates and can effectively attack multiple deepfake models at the same time. Through ablation experiments, on the one hand, we verify the effectiveness of each CDMAA component; on the other hand, compared with other similar research methods, we verify the superiority of CDMAA.
Since CDMAA needs to use the gradient-based adversarial attack algorithm, future work can focus on how to extend this framework to no-gradients-required adversarial attack algorithms, such as AdvGAN [
38] or Boundary Attack [
39]. In addition, we will try to extend CDMAA to attack other data types of deepfake, such as video and voice.