Next Article in Journal
Increasing Neural-Based Pedestrian Detectors’ Robustness to Adversarial Patch Attacks Using Anomaly Localization
Previous Article in Journal
LittleFaceNet: A Small-Sized Face Recognition Method Based on RetinaFace and AdaFace
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Local Adversarial Attack with a Maximum Aggregated Region Sparseness Strategy for 3D Objects

1
Department of School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
2
Department of Hunan Provincial Institute of Land and Resources Planning, Hunan Key Laboratory of Land Resources Evaluation and Utilization, Changsha 410083, China
3
Department of Precision Instrument, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
J. Imaging 2025, 11(1), 25; https://doi.org/10.3390/jimaging11010025
Submission received: 8 December 2024 / Revised: 7 January 2025 / Accepted: 9 January 2025 / Published: 13 January 2025

Abstract

:
The increasing reliance on deep neural network-based object detection models in various applications has raised significant security concerns due to their vulnerability to adversarial attacks. In physical 3D environments, existing adversarial attacks that target object detection (3D-AE) face significant challenges. These attacks often require large and dispersed modifications to objects, making them easily noticeable and reducing their effectiveness in real-world scenarios. To maximize the attack effectiveness, large and dispersed attack camouflages are often employed, which makes the camouflages overly conspicuous and reduces their visual stealth. The core issue is how to use minimal and concentrated camouflage to maximize the attack effect. Addressing this, our research focuses on developing more subtle and efficient attack methods that can better evade detection in practical settings. Based on these principles, this paper proposes a local 3D attack method driven by a Maximum Aggregated Region Sparseness (MARS) strategy. In simpler terms, our approach strategically concentrates the attack modifications to specific areas to enhance effectiveness while maintaining stealth. To maximize the aggregation of attack-camouflaged regions, an aggregation regularization term is designed to constrain the mask aggregation matrix based on the face-adjacency relationships. To minimize the attack camouflage regions, a sparseness regularization is designed to make the mask weights tend toward a U-shaped distribution and limit extreme values. Additionally, neural rendering is used to obtain gradient-propagating multi-angle augmented data and suppress the model’s detection to locate universal critical decision regions from multiple angles. These technical strategies ensure that the adversarial modifications remain effective across different viewpoints and conditions. We test the attack effectiveness of different region selection strategies. On the CARLA dataset, the average attack efficiency of attacking the YOLOv3 and v5 series networks reaches 1.724, which represents an improvement of 0.986 (134%) compared to baseline methods. These results demonstrate a significant enhancement in attack performance, highlighting the potential risks to real-world object detection systems. The experimental results demonstrate that our attack method achieves both stealth and aggressiveness from different viewpoints. Furthermore, we explore the transferability of the decision regions. The results indicate that our method can be effectively combined with different texture optimization methods, with the average precision decreasing by 0.488 and 0.662 across different networks, which indicates a strong attack effectiveness.

1. Introduction

In recent years, the rapid advancement of deep learning in computer vision has led to the widespread application of convolutional neural networks (CNNs) in 3D object detection and recognition [1,2,3,4]. Recent studies have demonstrated that the recognition capabilities of neural networks are limited and susceptible to minor perturbations in datasets, which results in erroneous predictions [1,5,6]. This phenomenon is called adversarial attacks and involves perturbed samples called adversarial examples. These adversarial examples can deceive neural networks, including various security systems, and pose significant threats to the safety and effectiveness in real-world applications [7,8,9].
Research on adversarial examples has predominantly focused on traditional 2D image adversarial attacks by modifying certain pixels in the digital realm [3,10,11]. In practical applications, it is essential to consider how to translate adversarial attacks from the digital domain to physical-world adversarial disguises. Current methods to generate adversarial patches can be categorized into two types based on the training form: (1) 2D Adversarial Patches: These are generated from 2D images and printed or affixed to real-world objects [12,13,14,15]. Most of these methods are based on gradient optimization techniques such as FGSM [16,17,18], C&W [19], particle swarm optimization (PSO) [20], and reinforcement learning (RL) [21]; (2) 3D Adversarial Examples: These directly modify the shape or texture of 3D objects to achieve adversarial attacks [22,23]. Additionally, 2D image-based adversarial attacks struggle to adapt to the complexities of the real world and are generally less effective in the 3D domain than attacks based on 3D objects. These methods have been applied in various security scenarios such as facial recognition and vehicle identification [24,25,26].
However, three main issues arose from the aforementioned works:
1
For the 2D-3D gradient transfer problem, the adversarial object method is used [27,28,29], which introduces undifferentiable optimization parameters. This makes it challenging to apply gradient-based optimization techniques effectively, leading to suboptimal attack performance and reduced ability to fine-tune adversarial perturbations for enhanced stealthiness.
2
Adapting to multi-angle transformations is difficult: Optical control is used to create an optical effect [30,31,32], so the adversarial attacks are only valid at certain angles. As a result, these attacks lack robustness when the object is viewed from different perspectives, limiting their practicality in dynamic real-world environments where the viewing angles can vary unpredictably.
3
To balance the visual stealth and attack effectiveness, the texture or shape of the target is modified [33,34,35], which results in camouflages that may be too large to be hidden by artificial vision or may be small but have less attack effectiveness. This trade-off often forces a compromise where either the camouflage becomes easily noticeable, thereby compromising stealth, or it remains inconspicuous but fails to achieve the desired level of attack effectiveness, undermining the overall goal of the adversarial attack.
These limitations highlight the need for more sophisticated attack strategies that can maintain high levels of stealth while ensuring robust and effective adversarial performance across various real-world conditions. Thus, a new method that can comprehensively handle these challenges is needed.
In this paper, we propose a local adversarial attack with a Maximum Aggregated Region Sparseness (MARS) strategy. To address the issues of 3D-2D gradient propagation in the physical world and complex environmental factors, we employ a neural rendering method to obtain differentiable multi-angle augmented data. To balance the attack efficiency and visual stealth, we maximize the aggregation and minimize the attack camouflage under the guidance of object detection. To maximize the aggregation of attack camouflage regions, an aggregation regularization term is designed to constrain the mask aggregation matrix based on the face-adjacency relationships. To minimize the attack camouflage regions, sparseness regularization is introduced, which makes the mask weights tend toward a U-shaped distribution and limits extreme values. By suppressing model detection to locate universal critical decision regions from multiple angles, texture modifications in these regions achieve universal local camouflage. The main contributions of this paper are as follows:
1
We propose a local 3D attack framework driven by the Maximum Aggregated Region Sparseness (MARS) strategy, which ensures visual stealth while achieving efficient attacks.
2
We design an optimization regularization specifically for 3D mesh faces, including aggregation regularization to maximize the aggregation of attack camouflage regions and sparseness regularization to minimize the attack camouflage regions. This approach effectively adapts to the characteristics of deep neural networks and identifies critical decision regions on 3D objects.
3
Compared with the baseline methods, the MARS attack strategy achieves an average attack efficiency (AE) of 1.724, i.e., an improvement of 0.986 (134%). Additionally, MARS effectively attacks vehicles from different viewpoints, so it is robust to angle variations. Transferability experiments indicate that MARS exhibits strong attack performance in black-box environments.
Our paper is organized as follows: Section 2 illustrates the related work, and Section 3 describes the details of our proposed method. The experiments and conclusions are presented in Section 4 and Section 5.

2. Related Work

In this section, we introduce the recent exploration of physical adversarial examples and rendering.

2.1. Physical Adversarial Attack

Recently, the rapid advancement of deep learning in the domain of computer vision has significantly transformed the process of automated understanding of realistic images [36,37,38,39,40], and it has found widespread application across diverse professional domains [41,42,43,44,45]. Nevertheless, research into adversarial attacks on realistic images for deep learning is also progressing [46,47]. Example adversarial attack algorithms can be divided into digital attacks and physical attacks based on their implementation domains. In physical scenarios, numerous factors such as lighting and camera parameters affect the adversarial attack and make its implementation more complex. Currently, physical adversarial attacks can be classified into three categories based on the attack method: adversarial objects, optical adversarial attacks, and attribute modifications of the target object.
1. Adversarial Objects: This approach involves placing objects with specific shapes and textures on or near the target object. For example, Tsai et al. [48] implemented robust adversarial objects in both digital and physical worlds using point cloud data for PointNet++. Lee et al. [49] proposed the first practical method to generate 3D adversarial point clouds. Liu et al. [50] extended this field with additional attack methods and metrics. Kotuliak et al. [51] used GANs to create unrestricted false-positive adversarial examples. Cao et al. [52] introduced LIDAR-ADV, which generates adversarial objects that evade LIDAR-based detection systems under various conditions, to reveal potential vulnerabilities in autonomous driving detection systems. Although these methods are easy to implement in the physical world, they often lack differentiable optimization and have poor attack effectiveness. They can also be easily detected, removed, or neutralized by simple defense mechanisms such as random sampling and outlier removal, as described by Liu et al [50]. In contrast, our method enhances both the effectiveness and stealthiness of attacks by maximizing the sparsity of aggregated regions. This allows for stronger attacks within smaller areas, overcoming the limitations of existing approaches.
2. Optical Adversarial Attacks: These attacks involve manipulating light using optical instruments and can be categorized based on the stage of the imaging process. Light capture manipulation: Hu et al. [53] proposed AdvZL, which successfully attacks autonomous driving tasks using zoom lenses. Translucent patches [54] use transparent stickers to cover the camera lens and hide the target. Light source manipulation: Li et al. [55] used adversarial illumination to attack structured-light-based 3D face recognition, whereas AdvSL [56] used spotlights for flexible attacks in complex environments. Shadow creation: Zhong et al. [57] created shadows on traffic signs to produce naturalistic and stealthy optical adversarial examples. Wang et al. [58] proposed Reflective Light Attack (RFLA), which is effective in bright environments. These methods struggle to adapt to the complexities of the physical world and are only effective under specific conditions. In contrast, our approach leverages neural rendering techniques to enable gradient propagation across multiple angles, ensuring that attacks remain effective and stealthy from various viewpoints. This significantly enhances adaptability in complex environments.
3. Attribute Modifications of the Target Object: This category includes modifying the shape of the target [29], which is challenging to reproduce in the physical world, and modifying the texture of the target, which is performed in this paper. Texture modification is differentiable in the digital world and easy to reproduce in the physical world. Numerous studies have focused on physical adversarial attacks with different texture coverages. Full object coverage: Pautov et al. [18] projected adversarial patterns onto a mesh using a grid generator to simulate nonlinear transformations for facial recognition. Komkov et al. [16] used a spatial transformation layer (STL) to project perturbations onto a hat. Partial object coverage: Wei et al. [59] optimized the position and rotation angle of cartoon stickers to evade facial recognition systems, which made this method easier to implement and more threatening. DAS [60] combines meaningful and meaningless patches using smiley outlines (to attract attention) and unrecognizable patterns (to deceive DNNs). However, these methods often face a trade-off between coverage and attack effectiveness, making it difficult to achieve strong attacks while maintaining high stealthiness. In contrast, our proposed MARS strategy effectively achieves powerful attacks within limited regions by maximizing the sparsity of aggregated areas, balancing both stealth and attack strength.
In summary, due to safety and cost constraints, the existing adversarial camouflage methods often require large areas to cover the surface of an object, which results in visually obvious full object coverage attacks or inefficient partial attacks that only work on a limited number of angles. In contrast, our method employs the Maximum Aggregated Region Sparseness strategy to achieve more efficient attacks within smaller regions while maintaining effectiveness across multiple viewpoints. This significantly enhances both the practicality and stealthiness of the attacks. Therefore, this paper aims to find a method that balances the attack area and attack effectiveness.

2.2. Rendering in Adversarial Examples

As interest in physical adversarial attacks has grown, rendering in adversarial examples has also garnered the attention of researchers. The goal of rendering is to simulate 2D images of 3D objects under various environmental conditions. Traditional rendering [61,62] involves providing a series of inputs such as geometry, lighting, materials, and camera positions to rasterization or ray-tracing renderers. These renderers output images after a series of pipeline operations. The main challenge with these methods is the propagation of gradient information before and after rendering. To ensure that the rendering process is differentiable, the concept of neural rendering has been introduced. A neural render can be considered a universal function approximator that uses neural networks to transform various parameters into output images. Existing neural rendering methods primarily use differentiable renderers [63,64,65] to simulate rendering processes. For example, Zhang et al. [14] utilized the expectation over transformation (EOT) principle that Athalye et al. [66] proposed to train neural networks for approximate differentiation and simulate the imaging process. STA [67] analyzed the vulnerabilities of Siamese networks in visual tracking and proposed a differentiable rendering pipeline to generate perturbed texture maps for 3D objects, which successfully reduced the tracking accuracy and caused a tracker drift. Complex mapping-based neural rendering, which was used in this paper, facilitates gradient-based optimization. For example, the Neural Mesh Renderer [68] in MeshAdv [69] quickly integrates rendering into neural networks. Yang et al. [70] experimented with various renderers [71,72,73,74] and analyzed their transferability in adversarial attacks to better understand adversarial attacks on 3D objects in the real world.
Currently, most renderers struggle to efficiently propagate gradients from 2D pixels to 3D mesh faces in the model recognition results. This limitation severely affects the generation of adversarial attack models based on 3D targets. In contrast, our proposed rendering method optimizes the gradient transfer process, significantly enhancing the efficiency and effectiveness of generating adversarial attack models for 3D targets. This provides robust support for efficient physical adversarial attacks. Therefore, this paper presents an efficient 3D-2D rendering method that facilitates fast gradient transfer.

3. Method

In this section, the details of the proposed method are presented, including the regularization for attack patch aggregation, regularization for attack camouflage minimization, and propagation and backpropagation between 3D objects and 2D images. The pseudocode for our algorithm is also demonstrated.

3.1. Framework

The Maximum Aggregated Region Sparseness (MARS) strategy is designed to enhance the stealth and effectiveness of local adversarial attacks on 3D objects. MARS operates by identifying and modifying critical decision regions on the target surface in a way that minimizes the visibility of perturbations while maximizing their impact on the detection model. The core of a local adversarial attack with the Maximum Aggregated Region Sparseness (MARS) strategy on 3D objects is to identify critical decision regions on the target surface. Based on the imaging principles of optical remote sensing and to address the optimization challenges in these critical regions, our framework is divided into the following components: (1) To maximize the aggregation of attack camouflage regions, an aggregation regularization term is designed to constrain the mask aggregation matrix based on face-adjacency relationships. (2) To minimize the attack camouflage regions, sparseness regularization is introduced, which drives the mask weights toward a U-shaped distribution and limits extreme values. (3) To facilitate the gradient propagation between 3D objects and 2D images, the camouflage is set as a fixed perturbation to obscure the original texture information. A neural rendering algorithm is used to introduce mask weights that control the proportion of camouflage to the original texture and ensure the differentiability of the optimization parameters. To achieve efficient attacks from multiple angles and distances, the renderer outputs many target images with varied camera parameters O = R ( M , T ; θ c ) , which are combined with the real background G to produce realistic images I = O + G . By minimizing the target confidence loss under the aggregation and sparseness regularization constraints, the framework learns the shape and position of patches on the target surface. This process identifies the critical decision regions on the target surface and enables effective attacks by altering only small portions of these regions. The framework of MARS is shown in Figure 1.
Overview of MARS: The MARS framework operates by balancing two key constraints: aggregation and sparsity. The aggregation constraint ensures that the adversarial perturbations are concentrated in specific, critical regions of the object, enhancing the attack’s effectiveness. The sparsity constraint limits the number of regions being modified, maintaining the stealthiness of the attack by preventing large, noticeable changes. Additionally, the use of a U-shaped weight distribution encourages the mask weights to adopt extreme values (close to 0 or 1), which simplifies the perturbation pattern and makes it less detectable. The process diagrams for each part are shown in Figure 1.

3.2. Generating Adversarial Camouflage Regions

3.2.1. Regularization for Attack Camouflage Aggregation

Observations during training reveal that without special constraints, the resulting decision regions may become fragmented, which leads to poor attack performance after the texture iteration. Considering the recognition patterns of deep neural networks, the obtained decision regions must be as aggregated as possible. This aggregation facilitates the creation of complete patterns in subsequent iterations and enables successful attacks by modifying only small areas.
To define the degree of aggregation of decision regions, we calculated the sum of mask weights for each face and its adjacent faces as an influencing factor for that face. This parameter is incorporated into the loss function. The iterative computation involves multiple adjacency results: in each iteration, the outermost faces are removed to obtain the weight values of the inner faces, and the process continues until the weight of the central-most face is determined; then, the iteration is stopped. The weight calculation of each face in every iteration considers directly adjacent faces and the neighbors of those adjacent faces (Figure 2).
There are three known adjacency situations for face patches: independent (0 adjacent faces), 1 adjacent face, and 2 adjacent faces. Therefore, the adjacency faces of edge patches must be ≥2. The list is initialized as an exterior face list, where each face is assigned a weight (face_weight) as a transparency value to control the surface texture. In each iteration, the core of each face in the list is calculated as c o r e = f a c e _ i f · f a c e _ w e i g h t . After the calculation, the new core value replaces the original weight value. At the end of a single iteration, faces with fewer than 2 adjacent faces are removed from the list. The iteration is repeated until the list length is ≤1. The final aggregation loss is calculated as follows:
L a g g = s u m [ l i s t e x t e r i o r _ f a c e ( c o r e ) ]
The aggregation matrix, as shown in Figure 3, indicates that the influence factors of each face on the core can be 0–1. Since the aggregation results replace the original weights during our calculations, we must ensure that the sum of the influence factors of adjacent faces exceeds 1. Based on experiments and empirical data, we scientifically set the influence factors using the distance of each face from the core (i.e., the number of edges apart), which is calculated as follows:
f a c e i f = M a x N / ( 2 d i s t a n c e · n u m )
where M a x N is the maximum number of adjacencies possible in this form to balance the size of the influence factors of adjacent faces; d i s t a n c e is the number of edges from the core; and n u m is the number of faces in different adjacency forms, such as three faces in the first layer of edge-to-edge adjacency and six faces in the second layer of edge-to-edge-to-edge adjacency.
The aggregation regularization encourages the adversarial patches to form contiguous regions rather than scattered points. By considering the adjacency relationships, the model ensures that modifications are concentrated in clusters, which makes the perturbations more effective and less detectable.
Since all parameters in this regularization are differentiable for each face, the gradient propagation in subsequent stages is not affected, which ensures that this optimization method is feasible. The goal of this part is to minimize L a g g , aggregate the important faces as much as possible in the optimization results, and consequently reduce the number of optimized regions.

3.2.2. Regularization for the Attack Camouflage Minimization

The purpose of this regularization is to ensure that the attack camouflage area remains within a certain range during training to accelerate the optimization process. To achieve a balanced distribution of important regions, simply considering other losses may lead to continuously monotonous gradients, which are insufficient to balance the area and adversarial effectiveness. Therefore, the attack camouflage minimization must be incorporated into the training process.
We observe that during training, many faces with uniformly distributed mask weights occupy large areas but hardly contribute to the attack effectiveness. We refer to this phenomenon as “mask uniform distribution.” These masks achieve attacks through a specific set of camouflages with varying transparencies, which contradicts our intention of using fixed perturbations and fails to obscure texture features. The identified regions lack decisiveness and transferability, as shown in Figure 4.
To prevent the adversarial patches from being too spread out and noticeable, we introduce a sparsity constraint. This encourages the mask weights to adopt a U-shaped distribution, where most weights are pushed towards 0 or 1. Such a distribution ensures that only specific regions are heavily modified (weights near 1), while the rest remain largely unchanged (weights near 0), enhancing stealthiness and ensuring that perturbations are concentrated where they are most effective.
To address this issue, we use the Mean Square Error (MSE) loss to constrain the weight distribution and encourage the weights to trend toward 0 or 1 during training. This step prevents certain combinations of patterns from impacting the detection results. The MSE loss is calculated as follows:
L m s e ( H ( M ) , I ) = i = 1 m ( H ( M i ) 1 ) 2 / m
where I is an all-one matrix, M is the mask weight matrix, m is the number of faces, and H ( M ) is the matrix form of the array for calculation.
The attack camouflage minimization is divided into two parts. The first part restricts the mask weight of each face to approaching 0 or 1. The second part limits the number of faces with weights approaching 1 to select important regions. The mask weight defines the minimization and represents the opacity level of the fixed perturbation on each face (0 indicates no perturbation, and 1 indicates high visibility of the perturbation). The attack camouflage region minimization is defined as the sum of the power functions of the face weights. To maintain consistency with the MSE loss, the L2 norm is used for sparsity constraints. The sparsity coefficient loss is calculated as follows:
L s p a r s e = L m s e ( H ( M ) , I ) + α | | M | | 2

3.2.3. Total Loss

The total loss consists of three parts: L a g g is the aggregation of the attack regions, L s p a r s e is the area of the attack regions, and L a d v is the adversarial attack effectiveness:
L a g g Loss: To aggregate the adversarial camouflage regions and enable the MARS algorithm to generate more comprehensive regions and enhance the local attack performance, L a g g is included in the total loss, as calculated by Equation (1).
L s p a r s e Loss: To minimize the adversarial camouflage regions and enable the MARS algorithm to produce smaller regions with more natural visual effects, L s p a r s e is included in the total loss, as calculated by Equation (4).
L a d v Loss: To ensure the decision of adversarial camouflage regions, L a d v is included in the total loss; its primary function is to reduce the accuracy of the detection model after the adversarial attacks to achieve the desired adversarial effectiveness. The L a d v loss consists of three parts: the bounding box regression loss measures the difference between the detection box and the original ground truth box; the class loss L c l s calculates the difference between classification and the true class; and the object loss L o b j reflects the object confidence. The Binary Cross-Entropy (BCE) loss is used here, and the confidence label L and predicted confidence P are used to compute the overall confidence loss. To achieve efficient target attacks, these losses are scaled down by specific hyperparameter ratios during the training process. Additionally, to better control reproducibility in the physical world, we incorporate a smoothness constraint during the texture modification process to regulate color transitions. The smooth loss L s m o o t h is calculated via Equation (8). L a d v is calculated via Equation (9):
I o U = ( A B ) / ( A B )
L b b o x = s c a l e = 1 3 I o U s c a l e
L c l s = c l s = 1 n B C E ( p c l s , p c l s ^ )
L s m o o t h = i , j x i , j x i + 1 , j 2 + x i , j x i , j + 1 2
L a d v = L b b o x + L o b j + L c l s + L s m o o t h
In summary, the total loss function is designed as follows: Hyperparameters α and β control the gradual convergence of the perturbation patches to the desired number during the training process. Simultaneously, parameter γ must occupy a certain proportion to ensure the aggressiveness of the results. An ablation study on the hyperparameter selection is subsequently conducted.
L = α · L a g g + β · L s p a r s e + γ · L a d v

3.3. Three-Dimensional–Two-Dimensional Transformation

This section consists of two parts: (1) forward propagation, where 3D objects are transformed into 2D images, and (2) backward propagation, where the loss and gradients from the 2D images are returned to the 3D objects.
Forward propagation. The process of generating images from the 3D world is known as rendering, which serves as the boundary between the 3D world and 2D images and is crucial in computer vision. In this method, we use Neural Render, which uses polygon meshes as the 3D format and is represented by a few parameters to denote the 3D shape. This step enables the transformation of a 3D target into many 2D images that contain environmental information, which can be input into the detector for detection and gradient backpropagation.
The render pipeline converts vertices { V _ O _ I } in the object space to vertices { V _ S _ I } in the screen space. The rasterization process samples the vertices and faces to generate images and renders each face with its own s t · s t · s t texture map. Barycentric coordinates determine the corresponding texture space coordinates for a position p on the triangle { v 1 , v 2 , v 3 } , and bilinear interpolation samples from the texture image to generate the final image.
T = a p · m a s k _ w e i g h t + o t · ( 1 m a s k _ w e i g h t )
This process specifically involves converting the 3D model ( M , T ) into 2D images O that contain adversarial camouflage information. Here, M is the model mesh, and T is the model texture, which includes both texture and mask. The texture is calculated via Equation (11), where a p denotes the adversarial perturbation and o t signifies the original texture. Environmental image B is obtained by segmenting the original image to exclude target information. Final image I is created by compositing adversarial image O with environmental image B.
Backward propagation. Neural Render treats the rasterization process as one that can propagate the gradients backward to establish a deep relationship between the 3D target and 2D image, which enables optimization. A neural rendering network is trained here, introducing a suitable approximate gradient for neural network rendering and propagating the gradients to the texture, lighting, camera, and object shape.
To ensure effective gradient backpropagation, we employ linear interpolation to smooth out abrupt changes in color, which helps maintain meaningful gradient information. The U-shaped weight distribution facilitates this by pushing mask weights to their extremes, ensuring that gradients are either fully propagated or completely suppressed, avoiding intermediate values that could dilute the effectiveness of the adversarial perturbations.
Gradient backpropagation is challenging because sudden color changes can lead to zero gradients and prevent backpropagation. Therefore, linear interpolation is used to replace gradual changes among the pixels, as shown in Equation (12). The gradient at x 0 ( x 0 [ a , b ] ) is provided by Equation (13), which distinguishes between two cases: When the target pixel P j is inside or outside the face, the partial derivative is defined as zero, and the internal face color is used for the forward propagation to avoid color leakage. If P j is outside the face, the linear interpolation alters the color coefficient, so the derivatives on the left and right sides of x 0 are calculated, and their sum yields the gradient at x 0 . The specific formulas are provided in Equations (14)–(16).
I j ( x i ) x i becomes δ j I δ i x
I j ( x i ) x i | x i = x 0 = δ j I δ i x , δ j P δ j I < 0 0 , δ j P δ j I 0
I j ( x i ) x i | x i = x 0 = I j ( x i ) x i | x i = x 0 a + I j ( x i ) x i | x i = x 0 b
I j ( x i ) x i | x i = x 0 a = δ j ( I a ) δ x a , δ j P δ j a < 0 0 , δ j P δ j a 0
I j ( x i ) x i | x i = x 0 b = δ j ( I b ) δ x b , δ j P δ j b < 0 0 , δ j P δ j b 0
During the backpropagation, if multiple faces overlap, the intersection points are checked to determine whether they are rendered. If they do not overlap with the face, the gradient is not calculated.

3.4. Pseudocode for MARS

This section describes the Maximum Aggregated Region Sparseness (MARS) strategy-driven local 3D attack framework, which explains the optimization process of a local adversarial attack with the MARS strategy on 3D objects. To balance the visual stealth and attack effectiveness, we design the MARS algorithm to identify critical decision regions on the target. Using the characteristics of these critical regions and combining them with the network detection results, we design an effective loss function. The gradients of this loss function are backpropagated to guide the parameter updates on the surface of the target object, which models the identification of critical decision regions as an optimization problem. Finally, texture modifications are performed in these regions to achieve local adversarial camouflage with high attack effectiveness. Algorithm 1 outlines the local adversarial attack optimization algorithm based on local decision regions, where C represents the texture optimization method in the regions.
Algorithm 1 MARS
Input: 3D model ( M , T ) , camera parameter θ c , ground truth label y g t
Output: Aggregated Regions mask M a d v *
1:
Initialize M a d v * with {0}, adversarial perturbation A a d v with background color
2:
for the max iteration do
3:
   for the max batch size do
4:
     update m a s k
5:
      I = R ( M , T , θ c )
6:
      T a d v * = T · ( 1 M a d v * ) + A a d v · M a d v *
7:
      I a d v = R ( M , T a d v * , θ c )
8:
      y = D ( I a d v )
9:
     calculate L by Equation (10)
10:
     update M a d v * with gradient backpropagation
11:
   end for
12:
end for
13:
M a d v * = M ( M , T )
14:
T a d v * = C ( M , T , M a d v * )

4. Experiments and Results

In this section, we illustrate the details of our experimental setup and present the experimental results.

4.1. Experimental Setup

Training Settings: All models were trained for 20 epochs using the Adam optimizer with an initial learning rate of 0.02. The learning rate was scheduled to decay by a factor of 0.1 every five epochs. A batch size of 2 was used for training, and weight decay was set to 0.0005 to prevent overfitting. These hyperparameters were selected based on preliminary experiments to balance training efficiency and model performance.
Dataset: Unfortunately, in the field of autonomous vehicle detection, there is no comprehensive and open training dataset suitable for 3D adversarial attacks. Therefore, we chose a rendering-based generative dataset, which offers rich scenes and differentiable rendering. The Carla dataset is the basic dataset in this field, and to be consistent with the domain, the experimental dataset used in this study is the Carla dataset [75]. Therefore, we use the Carla simulator, which is an open simulation platform to simulate the process of a vehicle driving in the city, to generate remote simulation pictures under different perspectives, distances, and environments during the driving process. In total, 15,000 pictures are generated, which are exported from the Carla simulator as the Carla dataset standard, as shown in Figure 5.
Metric: To measure the effectiveness of the adversarial attack, we use two metrics: average precision (AP) and a custom metric called attack efficiency (AE). In this experiment, the AP is defined as the average precision for the class “car,” which is calculated using Equation (17). The attack efficiency (AE) is calculated using Equation (18):
A P = 0 1 p ( r ) d r
A E = δ A P p f a c e
where p ( r ) is the smoothed precision-recall curve, δ A P is the difference in average precision before and after the attack, and p f a c e is the proportion of faces modified during the attack. A reduction in AP indicates a decrease in the model’s ability to correctly detect and classify objects, thereby demonstrating the effectiveness of the adversarial attack. However, it is crucial to balance this reduction to avoid overly degrading the model’s performance, which could render the attack impractical or easily noticeable.
Experimental Schema: First, we train six detectors from the YOLOv3 and YOLOv5 series on the Carla dataset. We use three selection strategies (manual expert selection, random selection, and MARS) to identify different local regions. Full-body attacks are used as the baseline control group to compare and analyze the attack effects on these regions. The training was conducted under identical conditions for all models to ensure a fair comparison.
The experiments are divided into three parts:
1
Attack efficiency: the main objective is to demonstrate the superiority of MARS;
2
Transferability: this part assesses the transferability of the model by comparing the impacts of different texture optimization methods on the attack performance;
3
Parameter sensitivity analysis: this section presents an investigation of the significance and sensitivity of core parameters through variations and ablation experiments.

4.2. Local Adversarial Attacks with Different Region Selection Strategies

We use different selection strategies to identify various local regions and conduct local attacks using both fixed textures and optimized textures. Then, we compare their performance across different detection networks for the same dataset and training network conditions. YOLOv3 is selected as the training network, whereas YOLOv3, YOLOv5s, YOLOv5x, YOLOv5m, YOLOv5n, and YOLOv5l are selected as the detection networks. All detection networks are trained with identical settings using the Carla dataset. Figure 6 and Figure 7 show the experimental results.
The first row, which is labeled “fixed texture,” uses a fixed camouflage for local attacks with different region selection strategies. The first column contains sample images of the original unmodified targets. We aim to mitigate the impact of the perturbation on the detector to highlight the effects of different region selection strategies on the attack effectiveness. The second row, which is labeled “optimized texture,” employs optimized patches for local attacks with different region selection strategies. The first column contains sample images of full-body attacks, which serves as the baseline. We aim to demonstrate the superiority of local attacks using the MARS strategy. The comparison strategies for region selection are as follows: The second and third columns show fixed regions that are manually selected based on expert guidance and highlight the edge and center regions, respectively. The fourth column shows the randomly selected regions. The fifth column shows critical decision regions that are selected by MARS. We evaluate the AP for these attack scenarios, and Table 1 shows the results.
Performance Variations Between YOLOv3 and YOLOv5: The differences in performance between YOLOv3 and YOLOv5 can be attributed to variations in their architectural complexity and depth. YOLOv5 models are generally deeper and incorporate more advanced features such as enhanced backbone networks and better anchor box strategies, which can influence their susceptibility to adversarial attacks. Specifically, YOLOv5l, being the largest variant, exhibits lower attack effectiveness due to its increased robustness and capacity to generalize from complex patterns, whereas smaller variants like YOLOv5s are more vulnerable due to their reduced complexity. These architectural differences explain why the attack performance varies across different YOLOv5 variants compared to YOLOv3.
Table 1 reveals significant differences between the Fixed(center) and Fixed(edge) strategies. Edge regions have lower visibility from different viewpoints than central regions, which leads to varied contributions during attacks. The AP after the Fixed(center) attacks is consistently lower than that after the Fixed(edge) attacks, which indicates that object edges are not equivalent to decision boundaries in neural networks. Random regions show inconsistent performance. With a fixed texture, the AP decreases to 0.967, which is better than the values of the fixed regions (0.980 and 0.978). However, with the optimized texture, the AP only decreases to 0.726, which is less effective than the values of the fixed regions (0.719 and 0.184). Thus, although random regions cover a broader area, their fragmented nature hampers the effective optimization and deception of neural networks. Additionally, YOLOv3 and YOLOv5 models respond differently to various attack strategies due to their distinct network architectures. YOLOv3 tends to be more sensitive to centralized attacks, whereas YOLOv5 models, especially larger variants, exhibit varied sensitivity based on their depth and feature extraction capabilities. Our proposed MARS strategy achieves AP decreases of 0.618–0.188 under optimized conditions, so AP surpasses all other strategies and closely approximates the full-body attack results.
Compared with the baseline, the attack efficiency (AE) for both fixed texture and optimized texture attacks is shown below, and the superior performance of the MARS strategy is highlighted.
Table 2 and Table 3 show that the AE of our proposed MARS method consistently outperforms the baseline and other local region selection strategies. The variations in AE across different YOLOv5 variants indicate that deeper and more complex networks like YOLOv5l are more resilient to adversarial attacks, requiring more concentrated and effective perturbations to achieve significant AE. In YOLOv3, MARS achieved an AE of 2.615, which is more than double the baseline value (0.911). In the YOLOv5 series, MARS achieved at least 0.608, which significantly outperformed the control group (0.532). The average AE for MARS was 1.7235, i.e., a 0.986 (134%) improvement over the baseline. These experiments demonstrate that, compared with other methods, our method effectively balances the coverage and region completeness and significantly improves stability and transferability. This result confirms the feasibility of using local adversarial attacks with the Maximum Aggregated Region Sparseness (MARS) strategy on 3D objects to attack detectors.
A significant reduction in AP signifies that the adversarial attack effectively diminishes the model’s capability to accurately detect and classify objects. In practical terms, this could lead to scenarios where critical objects, such as vehicles in autonomous driving systems, go undetected or are misclassified, potentially causing safety hazards. However, the extent of AP reduction must be carefully managed to avoid rendering the system non-functional, which could be impractical or easily noticed by human operators.
However, due to differences in network depth and width, the robustness to adversarial examples varies. In particular, the YOLOv5l network consistently shows lower attack effectiveness. This variation highlights the need for tailored adversarial strategies that consider the architectural nuances of different detection models. Addressing the network complexity and enhancing adversarial robustness will be a focus of future work.

4.3. Attack Transferability

This section of the experiment has two main objectives: to determine whether MARS can be combined with different texture modification methods for attacks and to assess the attack effectiveness of models that are trained on YOLO detectors and applied to other detectors.
We select the full region, Fixed(center) region, and MARS region, which perform well and are representative of previous experiments, as variables. Using FCA [13] and DAS [60] for texture optimization, we compare the resulting adversarial outcomes across different detection networks. To ensure the independence of the results, the detection networks are selected outside the YOLO series and include Mask R-CNN, Cascade R-CNN, Faster R-CNN, SSD, and RetinaNet.
As shown in Table 4, FCA and DAS exhibit varying performance across different region selection strategies. Full attacks consistently demonstrate stable results, where FCA and DAS achieve average AP decreases of 0.631 and 0.627, respectively. For the Fixed(center) region, Fixed(center) + FCA achieves an average AP decrease of 0.356, whereas Fixed(center) + DAS achieves an average AP decrease of 0.525. For the MARS region, MARS + FCA achieves an average AP decrease of 0.488, and MARS + DAS achieves an average AP decrease of 0.662. The MARS attack consistently outperforms other local attack methods, particularly with DAS, where it even outperforms the full attack. Thus, the critical decision regions identified by MARS align well with the decision boundaries of the model and demonstrate strong transferability with texture modification methods. Furthermore, the enhanced performance of MARS across different detection networks suggests that the aggregation and sparsity constraints effectively target universally critical regions, making the adversarial perturbations more versatile and robust against various model architectures. This adaptability is crucial for real-world applications where multiple detection systems may be in use.
Considering the attack effectiveness across different detection networks, the MARS attack outperforms other local attack methods on all networks. Specifically, MARS + DAS exceeds the performance of other local region selection strategies and surpasses the full attack in all networks except Cascade R-CNN. This superior performance is likely due to the MARS strategy’s ability to focus perturbations on regions that are consistently influential across different models, enhancing both the attack’s effectiveness and its transferability.

4.4. Attack Performance for Different Factors

In this section, we adjust various parameter coefficients to explore the significance of each loss parameter in this paper.
The training model is a YOLOv3 network trained for one epoch on the Carla dataset. Since we must only compare the effects of different coefficients, the selected detectors are also the YOLOv3 and YOLOv5s networks trained for one epoch on the Carla dataset.
First, we conduct experiments with different combinations of loss parameters. The adversarial loss, which ensures the fundamental effectiveness of the attack, is not adjusted. During training, the loss parameters are set as follows: l o s s t o t a l (all), L a d v (single detection loss), α · L a g g + γ · L a d v (with aggregation regularization), and β · L s p a r s e + γ · L a d v (with minimization regularization).
A comparison of Figure 8 shows that only pursuing aggregation during training tends to result in a uniform distribution of mask weights. The resulting regions lack constraints on the mask area and mask weight range and consequently do not possess decisive characteristics. Table 5 shows that with only the adversarial loss ( L a d v ) set, the AP decreases to 0.686 and 0.704. When only aggregation regularization ( L a g g ) is used, the AP only decreases to 0.77 and 0.655, which indicates poor attack performance. Conversely, when only sparseness regularization ( L s p a r s e ) is used, the AP decreases to 0.213 and 0.482, which significantly enhances the attack performance. This demonstrates that sparsity regularization is crucial for concentrating adversarial perturbations in critical regions, thereby increasing the attack’s effectiveness while maintaining stealthiness. The combined loss settings of L a g g and L s p a r s e , which are ultimately adopted in this paper, achieve the best overall performance, identify universal decision regions, and successfully execute attacks.
In the second part of the experiment, we focus on adjusting the pre-parameters for the aggregation loss. As shown in Table 6, adjustments to parameter L a g g reveal that increasing the parameter gradually improves the attack performance. However, after reaching a certain balance, the efficiency begins to decline. The mid-range values exhibit relatively stable results across all networks. This indicates that there is an optimal range for the aggregation coefficient where the aggregation of adversarial regions is maximized without causing over-concentration that could potentially dilute the attack’s effectiveness. The analysis indicates that the L a g g coefficient controls the degree of aggregation in local regions. Increasing the coefficient yields more complete regions, which provides more possibilities for the optimization process. This behavior enables the generation of various continuous patterns that can deceive deep neural networks, which significantly impacts the attack effectiveness. Coefficient L a d v controls the network attack effect; to better enhance the attack efficiency, it is essential to consider the balance among the loss coefficients. This experiment briefly explores the impact of the parameter coefficients on the experimental results. In future research, we will more precisely define the significance of coefficient L a g g and delve deeper into the relationships among various coefficients.
In this part of the experiment, we keep other coefficients constant while adjusting the pre-parameter for the sparsity coefficient to modify the number of masks generated during optimization. This process enables us to compare the impact of the mask size on the attack efficiency. Table 7 shows the AP results. Since the variable here is the number of masks, we switch the metric to AE for easier comparison in Table 8.
As shown in Table 7, regarding the attack effectiveness, the number of masks has minimal effect when the count exceeds 2000. However, significant precision changes occur when the mask count fluctuates within the 0–2000 range. This result suggests that for this physical target, the optimal number of core critical regions detected by deep neural networks is less than 2000. This finding supports the main premise of this study: traditional adversarial attacks often optimize non-critical regions, which wastes computational resources. Our proposed method effectively identifies a sufficient number of local critical regions, reduces computational costs, enhances the visual effects, and achieves excellent attack performance across different detection networks under various mask counts.
As shown in Table 8, reducing the number of masks leads to a noticeable decrease in precision impact but a slight increase in attack efficiency. We observe that reducing the number of masks consistently increases the attack efficiency, which further emphasizes the importance of studying critical decision regions. When the mask count decreases during training, the computational speed improves, which is negligible in small-scale training but highly significant in large-scale large-model training. The aim of our study was to ensure rapid training and excellent visual effects while achieving significant attack effectiveness. The improvement in attack efficiency positively reflects this goal. Overall, our method consistently achieves robust adversarial results across different networks under most mask count settings, which demonstrates its broad applicability.
Ethical Considerations: The development and deployment of adversarial attacks pose significant ethical concerns, particularly regarding their potential misuse in critical systems such as autonomous driving, surveillance, and security infrastructures. These attacks can undermine the reliability and safety of systems that people depend on daily, leading to severe societal and economic consequences. To mitigate these risks, it is essential to implement robust defense mechanisms, promote responsible research practices, and establish regulatory frameworks that govern the use of adversarial technologies. Additionally, raising awareness about the vulnerabilities of machine learning models can encourage the development of more resilient systems and ethical guidelines for deploying such technologies.
Societal Risks: Adversarial attacks against critical systems such as autonomous vehicles, security surveillance, and infrastructure monitoring pose significant societal risks. These attacks can lead to accidents, breaches of privacy, and disruptions of essential services, potentially causing widespread harm. The ability to deceive detection models undermines trust in automated systems and can have far-reaching implications for public safety and security. To address these risks, it is imperative to develop robust defense mechanisms, enforce strict ethical guidelines, and promote collaboration between researchers, policymakers, and industry stakeholders to ensure that advancements in adversarial attacks do not compromise societal well-being.

5. Conclusions

In this study, we propose a novel method for identifying critical decision regions on the target surface and successfully achieving a balance between the visual effect and attack effectiveness in localized adversarial attacks. This approach demonstrates superior effectiveness compared to other region selection strategies and highlights its potential for targeted adversarial attacks. Moreover, when our region search strategy is combined with texture optimization, it yields outstanding attack performance and transferability. By focusing on smaller regions, this method reduces the introduction of non-transferable noise features and minimizes the likelihood of local optima in the attack model. Notably, in certain scenarios, our experimental results surpassed the effectiveness of the full-coverage optimization attack. Despite these strengths, a limitation of our method is the challenge of maintaining coherence between region optimization and subsequent texture optimization. Independently conducting these steps can introduce irrelevant noise, which may detract from the overall attack effectiveness. Future research will address this issue by investigating dual-parameter optimization techniques to simultaneously refine adversarial regions and textures. This integrated approach is expected to enhance the coherence and efficacy of adversarial attacks and advance the field of targeted adversarial strategies. Future avenues for improvement involve developing more sophisticated aggregation techniques, integrating MARS with other adversarial methods to create more potent attack strategies, and refining the approach to maintain effectiveness against increasingly robust detection systems. Addressing these challenges will further advance the capabilities of adversarial attacks, highlighting the necessity for ongoing research into model robustness and the development of more resilient detection frameworks.

Author Contributions

Conceptualization, X.L. and L.Z. (Ling Zhao); methodology, X.L.; software, B.L.; validation, X.L., L.Z. (Lili Zhu), B.L., H.C. and J.C.; formal analysis, X.L.; investigation, X.L.; resources, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, H.L., J.P. and L.Z. (Ling Zhao); visualization, X.L.; supervision, H.L., J.P. and L.Z. (Ling Zhao); project administration, H.L. and J.P. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China: 42301381.

Institutional Review Board Statement

This study did not involve human or animal subjects, and thus, no ethical approval was required.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data associated with this research are available online. The dataset is available at https://drive.google.com/drive/folders/1vspvRxnZ3shOV4kM5ELcO9-xztapBThS?usp=sharing (accessed on 20 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, A.; Zhang, K.; Zhang, R.; Wang, Z.; Lu, Y.; Guo, Y.; Zhang, S. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5291–5301. [Google Scholar]
  2. Feng, C.; Jie, Z.; Zhong, Y.; Chu, X.; Ma, L. Aedet: Azimuth-invariant multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21580–21588. [Google Scholar]
  3. Akhtar, N.; Mian, A. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access 2018, 6, 14410–14430. [Google Scholar] [CrossRef]
  4. Xiong, K.; Gong, S.; Ye, X.; Tan, X.; Wan, J.; Ding, E.; Wang, J.; Bai, X. Cape: Camera view position embedding for multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21570–21579. [Google Scholar]
  5. Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial machine learning at scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
  6. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  7. Chen, X.; Guo, W.; Tao, G.; Zhang, X.; Song, D. BIRD: Generalizable backdoor detection and removal for deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 40786–40798. [Google Scholar]
  8. Hu, Z.; Huang, S.; Zhu, X.; Sun, F.; Zhang, B.; Hu, X. Adversarial texture for fooling person detectors in the physical world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13307–13316. [Google Scholar]
  9. Zheng, J.; Lin, C.; Sun, J.; Zhao, Z.; Li, Q.; Shen, C. Physical 3D adversarial attacks against monocular depth estimation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24452–24461. [Google Scholar]
  10. Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
  11. Mosbach, M.; Andriushchenko, M.; Trost, T.; Hein, M.; Klakow, D. Logit pairing methods can fool gradient-based attacks. arXiv 2018, arXiv:1810.12042. [Google Scholar]
  12. Suryanto, N.; Kim, Y.; Kang, H.; Larasati, H.T.; Yun, Y.; Le, T.T.H.; Yang, H.; Oh, S.Y.; Kim, H. Dta: Physical camouflage attacks using differentiable transformation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15305–15314. [Google Scholar]
  13. Wang, D.; Jiang, T.; Sun, J.; Zhou, W.; Gong, Z.; Zhang, X.; Yao, W.; Chen, X. Fca: Learning a 3d full-coverage vehicle camouflage for multi-view physical adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 2414–2422. [Google Scholar]
  14. Zhang, Y.; Foroosh, H.; David, P.; Gong, B. CAMOU: Learning physical vehicle camouflages to adversarially attack detectors in the wild. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  15. Cui, J.; Guo, W.; Huang, H.; Lv, X.; Cao, H.; Li, H. Adversarial Examples for Vehicle Detection with Projection Transformation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5632418. [Google Scholar] [CrossRef]
  16. Komkov, S.; Petiushko, A. Advhat: Real-world adversarial attack on arcface face id system. In Proceedings of the IEEE 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 819–826. [Google Scholar]
  17. Nguyen, D.L.; Arora, S.S.; Wu, Y.; Yang, H. Adversarial light projection attacks on face recognition systems: A feasibility study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 814–815. [Google Scholar]
  18. Pautov, M.; Melnikov, G.; Kaziakhmedov, E.; Kireev, K.; Petiushko, A. On adversarial patches: Real-world attack on arcface-100 face recognition system. In Proceedings of the IEEE 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), Novosibirsk, Russia, 21–27 October 2019; pp. 0391–0396. [Google Scholar]
  19. Chen, S.T.; Cornelius, C.; Martin, J.; Chau, D.H. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, 10–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 52–68, Part I 18. [Google Scholar]
  20. Sharif, M.; Bhagavatula, S.; Bauer, L.; Reiter, M.K. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM Sigsac Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1528–1540. [Google Scholar]
  21. Wei, X.; Guo, Y.; Yu, J.; Zhang, B. Simultaneously optimizing perturbations and positions for black-box adversarial patch attacks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 5, 9041–9054. [Google Scholar] [CrossRef]
  22. Abdelfattah, M.; Yuan, K.; Wang, Z.J.; Ward, R. Adversarial attacks on camera-lidar models for 3d car detection. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 2189–2194. [Google Scholar]
  23. Tu, J.; Ren, M.; Manivasagam, S.; Liang, M.; Yang, B.; Du, R.; Cheng, F.; Urtasun, R. Physically realizable adversarial examples for lidar object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13716–13725. [Google Scholar]
  24. Brown, T.B.; Mané, D.; Roy, A.; Abadi, M.; Gilmer, J. Adversarial patch. arXiv 2017, arXiv:1712.09665. [Google Scholar]
  25. Ding, L.; Wang, Y.; Yuan, K.; Jiang, M.; Wang, P.; Huang, H.; Wang, Z.J. Towards universal physical attacks on single object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1236–1245. [Google Scholar]
  26. Liu, A.; Liu, X.; Fan, J.; Ma, Y.; Zhang, A.; Xie, H.; Tao, D. Perceptual-sensitive gan for generating adversarial patches. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; Volume 33, pp. 1028–1035. [Google Scholar]
  27. Liu, A.; Guo, J.; Wang, J.; Liang, S.; Tao, R.; Zhou, W.; Liu, C.; Liu, X.; Tao, D. X-adv: Physical adversarial object attacks against x-ray prohibited item detection. arXiv 2023, arXiv:2302.09491. [Google Scholar]
  28. Oviedo, J.J.E.; Boelen, T.; Van Overschee, P. Robust advanced PID control (RaPID): PID tuning based on engineering specifications. IEEE Control Syst. Mag. 2006, 26, 15–19. [Google Scholar]
  29. Wen, Y.; Lin, J.; Chen, K.; Chen, C.P.; Jia, K. Geometry-aware generation of adversarial point clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2984–2999. [Google Scholar] [CrossRef] [PubMed]
  30. Duan, R.; Mao, X.; Qin, A.K.; Chen, Y.; Ye, S.; He, Y.; Yang, Y. Adversarial laser beam: Effective physical-world attack to dnns in a blink. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 16062–16071. [Google Scholar]
  31. Gnanasambandam, A.; Sherman, A.M.; Chan, S.H. Optical adversarial attack. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtually, 11–17 October 2021; pp. 92–101. [Google Scholar]
  32. Nichols, N.; Jasper, R. Projecting trouble: Light based adversarial attacks on deep learning classifiers. arXiv 2018, arXiv:1810.10337. [Google Scholar]
  33. Duan, Y.; Chen, J.; Zhou, X.; Zou, J.; He, Z.; Zhang, W.; Pan, Z. Dpa: Learning robust physical adversarial camouflages for object detectors. arXiv 2021, arXiv:2109.00124. [Google Scholar]
  34. Huang, L.; Gao, C.; Zhou, Y.; Xie, C.; Yuille, A.L.; Zou, C.; Liu, N. Universal physical camouflage attacks on object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 720–729. [Google Scholar]
  35. Wen, R.; Wang, J.; Wu, C.; Xiong, J. Asa: Adversary situation awareness via heterogeneous graph convolutional networks. In Proceedings of the Companion Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 674–678.
  36. Tao, C.; Qi, J.; Guo, M.; Zhu, Q.; Li, H. Self-supervised remote sensing feature learning: Learning paradigms, challenges, and future works. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610426. [Google Scholar] [CrossRef]
  37. Zhang, Z.; Ren, Z.; Tao, C.; Zhang, Y.; Peng, C.; Li, H. GraSS: Contrastive Learning With Gradient-Guided Sampling Strategy for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5626814. [Google Scholar] [CrossRef]
  38. Peng, J.; Ye, D.; Tang, B.; Lei, Y.; Liu, Y.; Li, H. Lifelong learning with cycle memory networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 16439–16452. [Google Scholar] [CrossRef]
  39. He, S.; Luo, Q.; Fu, X.; Zhao, L.; Du, R.; Li, H. CAT: A Causal Graph Attention Network for Trimming Heterophilic Graphs. Inf. Sci. 2024, 677, 120916. [Google Scholar] [CrossRef]
  40. Shao, R.; Zhang, Z.; Tao, C.; Zhang, Y.; Peng, C.; Li, H. Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding. ISPRS J. Photogramm. Remote Sens. 2024, 18, 294–310. [Google Scholar] [CrossRef]
  41. Li, H.; Cao, J.; Zhu, J.; Luo, Q.; He, S.; Wang, X. Augmentation-free graph contrastive learning of invariant-discriminative representations. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11157–11167. [Google Scholar] [CrossRef]
  42. Zhu, J.; Han, X.; Deng, H.; Tao, C.; Zhao, L.; Wang, P.; Lin, T.; Li, H. KST-GCN: A knowledge-driven spatial-temporal graph convolutional network for traffic forecasting. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15055–15065. [Google Scholar] [CrossRef]
  43. Luo, Q.; He, S.; Han, X.; Wang, Y.; Li, H. LSTTN: A Long-Short Term Transformer-based spatiotemporal neural network for traffic flow forecasting. Knowl.-Based Syst. 2024, 293, 111637. [Google Scholar] [CrossRef]
  44. He, S.; Luo, Q.; Du, R.; Zhao, L.; He, G.; Fu, H.; Li, H. STGC-GNNs: A GNN-based traffic prediction framework with a spatial–temporal Granger causality graph. Phys. A Stat. Mech. Its Appl. 2023, 623, 128913. [Google Scholar] [CrossRef]
  45. Shao, R.; Yang, C.; Li, Q.; Xu, L.; Yang, X.; Li, X.; Li, M.; Zhu, Q.; Zhang, Y.; Li, Y.; et al. AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework. IEEE Trans. Geosci. Remote Sens. 2025, 1. [Google Scholar] [CrossRef]
  46. Wei, H.; Tang, H.; Jia, X.; Wang, Z.; Yu, H.; Li, Z.; Satoh, S.; Van Gool, L.; Wang, Z. Physical adversarial attack meets computer vision: A decade survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9797–9817. [Google Scholar] [CrossRef]
  47. Akhtar, Z. Deepfakes generation and detection: A short survey. J. Imaging 2023, 9, 18. [Google Scholar] [CrossRef]
  48. Tsai, T.; Yang, K.; Ho, T.Y.; Jin, Y. Robust adversarial objects against deep learning models. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 954–962. [Google Scholar]
  49. Lee, K.; Chen, Z.; Yan, X.; Urtasun, R.; Yumer, E. Shapeadv: Generating shape-aware adversarial 3d point clouds. arXiv 2020, arXiv:2005.11626. [Google Scholar]
  50. Liu, D.; Yu, R.; Su, H. Extending adversarial attacks and defenses to deep 3d point cloud classifiers. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2279–2283. [Google Scholar]
  51. Kotuliak, M.; Schoenborn, S.E.; Dan, A. Synthesizing unrestricted false positive adversarial objects using generative models. arXiv 2020, arXiv:2005.09294. [Google Scholar]
  52. Cao, Y.; Xiao, C.; Yang, D.; Fang, J.; Yang, R.; Liu, M.; Li, B. Adversarial objects against lidar-based autonomous driving systems. arXiv 2019, arXiv:1907.05418. [Google Scholar]
  53. Hu, C.; Shi, W. Adversarial zoom lens: A novel physical-world attack to dnns. arXiv 2022, arXiv:2206.12251. [Google Scholar]
  54. Zolfi, A.; Kravchik, M.; Elovici, Y.; Shabtai, A. The translucent patch: A physical and universal attack on object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 15232–15241. [Google Scholar]
  55. Li, Y.; Li, Y.; Dai, X.; Guo, S.; Xiao, B. Physical-world optical adversarial attacks on 3d face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24699–24708. [Google Scholar]
  56. Yufeng, L.; Fengyu, Y.; Qi, L.; Jiangtao, L.; Chenhong, C. Light can be dangerous: Stealthy and effective physical-world adversarial attack by spot light. Comput. Secur. 2023, 132, 103345. [Google Scholar]
  57. Zhong, Y.; Liu, X.; Zhai, D.; Jiang, J.; Ji, X. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15345–15354. [Google Scholar]
  58. Wang, D.; Yao, W.; Jiang, T.; Li, C.; Chen, X. Rfla: A stealthy reflected light adversarial attack in the physical world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4455–4465. [Google Scholar]
  59. Wei, X.; Guo, Y.; Yu, J. Adversarial sticker: A stealthy attack method in the physical world. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2711–2725. [Google Scholar] [CrossRef] [PubMed]
  60. Wang, J.; Liu, A.; Yin, Z.; Liu, S.; Tang, S.; Liu, X. Dual attention suppression attack: Generate adversarial camouflage in physical world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 8565–8574. [Google Scholar]
  61. Jones, A.; McDowall, I.; Yamada, H.; Bolas, M.; Debevec, P. Rendering for an interactive 360 light field display. In ACM SIGGRAPH 2007 Papers; Association Computing Machinery: New York, NY, USA, 2007; Volume 26, p. 40. [Google Scholar]
  62. Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3D model views. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2686–2694. [Google Scholar]
  63. Li, T.M.; Aittala, M.; Durand, F.; Lehtinen, J. Differentiable Monte Carlo ray tracing through edge sampling. ACM Trans. Graph. (TOG) 2018, 37, 1–11. [Google Scholar] [CrossRef]
  64. Loper, M.M.; Black, M.J. OpenDR: An approximate differentiable renderer. In Proceedings of the Computer Vision-ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Proceedings. Springer: Berlin/Heidelberg, Germany, 2014; pp. 154–169, Part VII 13. [Google Scholar]
  65. Loubet, G.; Holzschuch, N.; Jakob, W. Reparameterizing discontinuous integrands for differentiable rendering. ACM Trans. Graph. (TOG) 2019, 38, 1–14. [Google Scholar] [CrossRef]
  66. Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 284–293. [Google Scholar]
  67. Wu, X.; Wang, X.; Zhou, X.; Jian, S. STA: Adversarial attacks on Siamese trackers. arXiv 2019, arXiv:1909.03413. [Google Scholar]
  68. Kato, H.; Ushiku, Y.; Harada, T. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3907–3916. [Google Scholar]
  69. Xiao, C.; Yang, D.; Li, B.; Deng, J.; Liu, M. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6898–6907. [Google Scholar]
  70. Yang, D.; Xiao, C.; Li, B.; Deng, J.; Liu, M. Realistic adversarial examples in 3d meshes. arXiv 2018, arXiv:1810.05206. [Google Scholar]
  71. Kolotouros, N. Pytorch Implememtation of the Neural Mesh Renderer. 2018. Available online: https://scholar.google.fr/citations?view_op=view_citation&hl=zh-CN&user=397EbTsAAAAJ&citation_for_view=397EbTsAAAAJ:WF5omc3nYNoC (accessed on 8 January 2025).
  72. Liu, H.T.D.; Tao, M.; Li, C.L.; Nowrouzezahrai, D.; Jacobson, A. Adversarial geometry and lighting using a differentiable renderer. CoRR 2018. [Google Scholar] [CrossRef]
  73. Nimier-David, M.; Vicini, D.; Zeltner, T.; Jakob, W. Mitsuba 2: A retargetable forward and inverse renderer. ACM Trans. Graph. (TOG) 2019, 38, 1–17. [Google Scholar] [CrossRef]
  74. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the NIPS 2017 Workshop Autodiff Submission, 2017; Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 8 January 2025).
  75. Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning, Sydney, NSW, Australia, 11–13 December 2017; pp. 1–16. [Google Scholar]
Figure 1. A local adversarial attack with a Maximum Aggregated Region Sparseness strategy for 3D objects.
Figure 1. A local adversarial attack with a Maximum Aggregated Region Sparseness strategy for 3D objects.
Jimaging 11 00025 g001
Figure 2. Overview of the Maximum Aggregated Region Sparseness strategy.
Figure 2. Overview of the Maximum Aggregated Region Sparseness strategy.
Jimaging 11 00025 g002
Figure 3. Aggregation matrix: influence factor calculation based on face–core distance.
Figure 3. Aggregation matrix: influence factor calculation based on face–core distance.
Jimaging 11 00025 g003
Figure 4. Adversarial examples without regularization for sparsity.
Figure 4. Adversarial examples without regularization for sparsity.
Jimaging 11 00025 g004
Figure 5. Overview of the dataset which contains a variety of simulated images from the Carla simulation, captured under different perspectives, distances, and environmental conditions.
Figure 5. Overview of the dataset which contains a variety of simulated images from the Carla simulation, captured under different perspectives, distances, and environmental conditions.
Jimaging 11 00025 g005
Figure 6. Examples of different region selection strategies.
Figure 6. Examples of different region selection strategies.
Jimaging 11 00025 g006
Figure 7. Adversarial examples generated from different angles based on MARS.
Figure 7. Adversarial examples generated from different angles based on MARS.
Jimaging 11 00025 g007
Figure 8. Mask weight distribution: (top) l o s s = L a g g + L a d v ; (bottom) l o s s = L a g g + L s p a r s e + L a d v .
Figure 8. Mask weight distribution: (top) l o s s = L a g g + L a d v ; (bottom) l o s s = L a g g + L s p a r s e + L a d v .
Jimaging 11 00025 g008
Table 1. AP of different region selection strategies.
Table 1. AP of different region selection strategies.
DetectorCleanFixed TextureOptimized Texture
CleanEdgeCenterRandomOursFullEdgeCenterRandomOurs
yolov30.9830.980.9780.9670.8990.0710.7190.1840.7260.174
yolov5m0.9910.990.990.9890.9310.1230.9350.7450.9120.373
yolov5n0.9880.9850.9850.9690.8880.3060.9560.7870.8870.523
yolov5l0.9530.9520.9520.9440.8310.4210.9070.8150.8790.765
yolov5s0.9640.9620.9560.9550.9020.2890.9190.5670.8050.337
yolov5x0.9940.9940.9930.9720.8150.2350.8420.6390.8530.502
Table 2. AE (fixed texture) of different region selection strategies.
Table 2. AE (fixed texture) of different region selection strategies.
DetectorFixed (Edge)Fixed (Center)RandomOurs
mask_num2000200020002000
yolov30.010.0160.0520.272
yolov5m0.0030.0030.0060.194
yolov5n0.010.010.0610.323
yolov5l0.0030.0030.0290.394
yolov5s0.0060.0260.0290.2
yolov5x00.0030.0710.579
Table 3. AE (optimized texture) of different region selection strategies.
Table 3. AE (optimized texture) of different region selection strategies.
DetectorFullFixed (Edge)Fixed (Center)Random
mask_num6466200020002000
yolov30.9110.8420.6390.75
yolov5m0.8680.9350.8150.842
yolov5n0.6820.9560.7870.639
yolov5l0.5320.9070.8150.765
yolov5s0.6750.9190.5670.514
yolov5x0.7590.8420.6390.502
Table 4. AP (optimized texture) with different texture optimizers.
Table 4. AP (optimized texture) with different texture optimizers.
MethodMask PatternMask R-CNNCascade R-CNNFaster R-CNNSSDRetinaNetAverage AE
clean0.7640.7230.7160.7050.7510.732
FCAFull0.1310.0520.0690.1990.0550.101↓0.631
Fixed(center)0.4440.3570.340.3740.3660.376↓0.356
MARS0.3140.2090.1880.2370.270.244↓0.488
DASFull0.1780.0830.1540.0780.0320.105↓0.627
Fixed(center)0.2590.2020.190.1990.1860.207↓0.525
MARS0.1120.0660.0530.0690.0490.070↓0.662
Table 5. AP (different factor combinations).
Table 5. AP (different factor combinations).
Lossagg+sparse+advagg+advsparse+advadv
yolov30.1740.770.2130.686
yolov50.3370.6550.4820.704
Table 6. AP (different factors: agg_loss).
Table 6. AP (different factors: agg_loss).
agg_loss0.50.7511.251.51.752
yolov30.20.1670.1740.1730.1740.1830.172
yolov50.3970.3650.3370.3340.3280.430.33
Table 7. AP (different factors: number of masks).
Table 7. AP (different factors: number of masks).
Mask Numbers06466500040003000200015001000500
yolov30.9830.07190.1290.1420.1480.1740.2160.2990.597
yolov50.9640.2890.3740.390.4260.3370.4950.5270.696
Table 8. AE (different factors: number of masks).
Table 8. AE (different factors: number of masks).
Mask Numbers6466500040003000200015001000500
yolov30.9111.1041.3591.82.6153.3064.4234.992
yolov50.6750.7630.9281.162.0272.0222.8263.466
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, L.; Lv, X.; Zhu, L.; Luo, B.; Cao, H.; Cui, J.; Li, H.; Peng, J. A Local Adversarial Attack with a Maximum Aggregated Region Sparseness Strategy for 3D Objects. J. Imaging 2025, 11, 25. https://doi.org/10.3390/jimaging11010025

AMA Style

Zhao L, Lv X, Zhu L, Luo B, Cao H, Cui J, Li H, Peng J. A Local Adversarial Attack with a Maximum Aggregated Region Sparseness Strategy for 3D Objects. Journal of Imaging. 2025; 11(1):25. https://doi.org/10.3390/jimaging11010025

Chicago/Turabian Style

Zhao, Ling, Xun Lv, Lili Zhu, Binyan Luo, Hang Cao, Jiahao Cui, Haifeng Li, and Jian Peng. 2025. "A Local Adversarial Attack with a Maximum Aggregated Region Sparseness Strategy for 3D Objects" Journal of Imaging 11, no. 1: 25. https://doi.org/10.3390/jimaging11010025

APA Style

Zhao, L., Lv, X., Zhu, L., Luo, B., Cao, H., Cui, J., Li, H., & Peng, J. (2025). A Local Adversarial Attack with a Maximum Aggregated Region Sparseness Strategy for 3D Objects. Journal of Imaging, 11(1), 25. https://doi.org/10.3390/jimaging11010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop