In this section, we discuss related works which include two parts: adversarial attack and adversarial defense.
2.1. Adversarial Attack
Adversarial attacks try to force deep neural networks(DNNs) to make mistakes by crafting adversarial examples with human imperceptible perturbations. We denote x as the input of DNN, as the label of input x, and as the well-trained DNNs. Given x and network , we can obtain the label of input x through forward propagation; in general, we can call x an adversarial example if . Here, we introduce five mainstream attack methods including FGSM, PGD, DeepFool, JSMA, and CW. They are all typical attack methods ranging from , , to norms.
FGSM: The fast gradient sign method(FGSM) was proposed by Goodfellow et al. [
3] and is a single-step attack method. The elements of the imperceptibly small perturbation are equal to the sign of the elements of the gradient of the loss function with respect to the input; therefore, it is a typical
-norm attack method. The discovery of the FGSM also proved that the direction of the perturbation, rather than the specific point in space, mattered most.
PGD: The projected gradient descent (PGD) was proposed by Madry et al. [
7] and is a multistep attack method. As in the FGSM [
3], it also utilizes the gradient of the loss function with regard to the input to guide the generation of adversarial examples. However, the method introduces random perturbations and replaces one big step with several small steps; therefore, it can generate more accurate adversarial examples but it also requires a higher computation complexity.
JSMA: The Jacobian based saliency map attack(JSMA) [
22] was proposed by Papernot et al. and is a typical
-norm method. It aims to change as few pixels as possible by perturbing the most significant pixels to mislead the model. In this process, the approach updates a saliency map to guide the choice of the most significant pixel at each iteration. The saliency map can be calculated by:
where
i is a pixel index of the input.
DeepFool: This algorithm was proposed by Dezfooli et al. [
23] and is a nontarget attack method. It aims to find minimal perturbations. The method views the model as a linear function around the original sample and adopts an iterative procedure to estimate the minimal perturbation from the sample to its nearest decision boundary. By moving vertically to the nearest decision boundary at each iteration, it reaches the other side of the classification boundary. Since the DeepFool algorithm can calculate the minimal perturbations, therefore, it can reliably quantify the robustness of DNNs.
CW: This refers to a series of attack methods for the
,
, and
distance metrics proposed by Carlini and Wagner [
24]. In order to generate strong attacks, they introduced confidence to strengthen the attack performance, and to ensure the modification yielded a valid image, they introduced a
change of variables to deal with the “box constraint” problem. As a typical optimization-based method, the overall optimization function can be defined as follows:
where
c is the confidence,
D is the distance function, and
is the cost function. We adopted the
-norm attack in the following experiments.
Furthermore, there are black-box adversarial attack methods. Compared with white-box adversarial attacks, they are harder to work or need more perturbations, therefore are easier to be detected. In this paper, we focus on white-box attacks to test detectors.
2.2. Adversarial Defense
In general, adversarial defense can be roughly categorized into three classes: (i) improving the robustness of the network, (ii) input modification, and (iii) detecting-only and then rejecting adversarial examples.
The methods aimed to build robust models try to classify the adversarial example as the right label. As an intuitive method, adversarial training has been extended to many versions from its original version [
3] to fitting on large-scale datasets [
25] and to ensemble adversarial training [
6]. Currently it is still a strong defense method. Although adversarial training is useful, it is computationally expensive. Papernot et al. [
8] proposed a defensive distillation to conceal the information of the gradient to defend against adversarial examples. Later, Ross et al. [
26] refuted that the defensive distillation could make the models more vulnerable to attacks than an undefended model under certain conditions, and proposed to enhance the model with an input gradient regularization.
The second line of research is input modification, which modifies the input data to filter or counteract the adversarial perturbations. Data compression as a defense method has attracted a lot of attention. Dziugaite et al. [
11] studied the effects of JPG compression and observed that JPG compression could actually reverse the drop in classification accuracy of adversarial images to a large extent. Das et al. [
12] proposed an ensemble JPEG compression method to counteract the perturbations. Although data compression methods achieve a resistance effect to a certain extent, compression also results in a loss of the original information. In the article [
10], the authors proposed a thermometer encoding to defend against adversarial attacks which could ensure no loss of the original information.
Detection-only defense is the other way to defend against adversarial attacks. We divided these methods into two categories: (i) detecting adversarial examples in the input space with raw data and (ii) using latent features of the models to extract disentangled features. For the first category of methods, Kheerchouche et al. [
18] proposed to collect natural scene statistics (NSS) from input space to detect adversarial examples. Grosse et al. [
19] proposed to train a new
class for adversarial examples classification. Gong et al. [
20] constructed a similar method to train a new binary classifier with normal examples and adversarial examples.
The second category of adversarial detection methods uses the target model to extract disentangled features to discriminate adversarial examples. Yang et al. [
17] observed that the feature attribution map of an adversarial example near the decision boundary was always different from the corresponding original example. They proposed to calculate the feature attributions from the target model and use the leave-one-out method to measure the differences in feature attributions between adversarial examples and normal examples and further detect adversarial examples. feinman et al. [
21] proposed to detect the adversarial examples by kernel density estimates in the hidden layer of a DNN. They trained kernel density estimates (KD) on normal examples according to different classes, and the probability density values of adversarial examples should be less than that of those normal examples, by which they formed an adversarial detector. Schwinn et al. [
27] analyzed the geometry of the loss landscape of neural networks based on the saliency maps of the input and proposed a geometric gradient analysis (GGA) to identify the out-of-distribution (OOD) and adversarial examples.
Most related to our work, Ma et al. [
13] proposed to use the local intrinsic dimensionality (LID) to detect adversarial examples; the estimator of the LID of
x was defined as follows:
where
denotes the distance between
x and its
ith nearest neighbor in the activation space and the
is the largest distance among the k-nearest neighbors. They calculated the LID value of samples in each layer and trained a linear regression classifier to discriminate the adversarial examples from normal examples. Our method used the same intuition, that is, we compared the test data with normal data, but we introduced the concept of inner class to limit the comparison scope within the same class label and unlike the LID calculating a Euclidean distance, we used a different basic similarity metric, the cosine similarity.