We implemented our SLIC using the CompressAI [
56] release of neural codecs [
24,
25,
26], denoted as Balle2018, Minnen2018, and Cheng2020. The MSE-optimized pre-trained models from CompressAI were used for fine-tuning. For training, we randomly selected 90% of the images from the COCO dataset [
57] as the training set and the remaining 10% as the validation set. We employed the PyTorch 2.0.0 built-in Adam optimizer with a
learning rate to fine-tune a pre-trained LIC mode for 100 epochs. An early stopping criterion was implemented during training if the learning rate decayed lower than
. Training images were randomly cropped as
patches with a batch size of 12 using an NVIDIA GeForce RTX 4090 GPU. We set
in the adversarial loss function from Equation (
6). In the overall rate–distortion adversarial loss function from Equation (
5), we follow the same
value setting as CompressAI and tweak the
hyper-parameter, as shown in
Table 1. We report the experimental numbers using a high bitrate setting as quality scale 8. However, the destructive re-compression effect can be observed for all quality settings.
We evaluated our SLIC on the Kodak [
58], FFHQ [
59], and DIV2K [
60] datasets. We used the original-resolution images from Kodak (
) and FFHQ (
). Due to GPU memory constraints, we resized the DIV2K images to around
and padded the size with a multiple of 32 using replication mode. To evaluate our SLIC’s robustness, we defined eight types of editing operations commonly mixed with image tampering: crop, Gaussian blur, median filtering, lightening, sharpening, histogram equalization, affine transform, and JPEG compression. The parameter settings are listed in
Table 2.
4.1. Perceptual Metrics Comparison
Since our adversarial re-compression loss leverages the perceptual metric
to divert the perceptual distance between re-compression results, we first tested different perceptual metrics with the Balle2018 codec to observe the adversarial effects. The last two rows of
Table 3 show that the traditional metrics
and
are unsuccessful in achieving this goal. Their re-compression PSNR values are high, so the visual quality remains indistinguishable to the human eye. During the training, the adversarial loss of MSE and MS-SSIM in
Figure 3a and re-compression PSNR in
Figure 3b remain flat. We think this is because the MSE is a summation-based metric, and the SSIM operates on a small local window with luminance, contrast, and structure components. Minimizing MSE or MS-SSIM for a reconstruction-based neural network is straightforward and efficient. Still, these pixel-based metrics cannot provide helpful perceptual features for the neural network to divert the re-compression quality.
Then, we tested the DNN-based perceptual metrics
and
. The VGGLoss uses the intermediate feature activations of a pre-trained VGG network (precisely, layers like
conv1_2,
conv2_2,
conv3_3, and
conv4_3) to compute the perceptual loss, which captures image quality more effectively than simple pixel-wise losses. The LPIPS metric builds on a pre-trained VGG network and adds a linear layer as weights to re-calibrate each feature map’s importance with a human-rated ground truth. From
Figure 3a, the use of
and
in the adversarial loss is efficient, as the loss decreases obviously through epochs. We observe a similar loss pattern on a recently developed DNN-based metric, DISTS.
However, the quality result from the PSNR value of
differs greatly from that of
and
, where its re-compression quality remains indistinguishable from the first compression result. The re-compression PSNR value of
or
, shown in
Table 3, is significantly lower, so the SLIC trained with perceptual metric
or
effectively degrades visual quality. We think this is because both LPIPS and DISTS learn a weighting layer on top of the VGG feature maps with a human-rated perceptual dataset, so the two metrics weigh more on features that are sensitive to human perception. As a result, a tiny invisible perturbation added to the compressed output will trigger a change in a perceptual sensitive feature map that causes severe quality damage in the re-compression. As for VGGLoss, feature maps are learned specifically for object detection purposes, and the magnitude of feature map coefficients highly impacts the calculation of VGGLoss (L1 or L2 norm). Therefore, we can observe that the adversarial loss of
decreases (which means increased quality divergence) during training in
Figure 3b, but the PSNR values remain high and less changed in
Figure 3a throughout the training process. That is, the increased distance of VGGLoss does not divert the actual perceptual distance of two images.
In addition to the DNN-based perceptual metric, we include the test with the better-developed traditional quality metrics GMSD and NLPD to compare their ability to divert the re-compression quality. The Gradient Magnitude Similarity Deviation (GMSD) [
36] is designed to measure perceptual similarity and focus on gradient information in both the
x and
y directions, as human vision is highly sensitive to edges and gradient changes in images. The Normalized Laplacian Pyramid Distance (NLPD) [
37] is rooted in Laplacian Pyramid decomposition, which captures image details at multiple scales and mimics the multi-scale nature of human visual perception.
Table 3 shows that both GMSD and NLPD reduce the PSNR of re-compression image
notably compared to MSE and MS-SSIM. However, the adversarial effect of NLPD is not robust to various editing operations like GMSD, which is showcased in
Section 4.3. We can think of NLPD as a metric scored from a simplified shallow network with Laplacian filters compared to the VGGLoss from a deep VGG network that learns filter maps from ImageNet. This may explain the lesser effectiveness of NLPD as a perceptual metric to divert the visual quality.
As GMSD computes the gradient magnitude similarity (GMS) between two images and uses the standard deviation of the GMS values to measure the difference, a minor update on the standard deviation of GMS may lead to regions of significant perceptual differences due to gradient change. This is why the GMSD is quite effective in diverting the visual quality as PSNR decreases, as shown in
Figure 3b.
4.2. Destructive-Compression Effects
We present the qualitative re-compression results of
in
Figure 4. In the second and third rows of
Figure 4, the re-compression quality of the SLIC codec trained with the perceptual metrics
and
is almost destroyed and unrecognizable from the prior compressed
, which aligns with the low averaged PSNR value of around 5 in
Table 3. The quality damage introduced by the perceptual metrics LPIPS and DISTS are far more severe than artifacts caused by other adversarial attacks [
15,
52]. It is interesting to note that LPIPS and DISTS have different artifact patterns, probably due to the design nature of the perceptual metric. The re-compression artifacts caused by the metric
combined with the neural codecs Balle2018, Minnen2018, and Cheng2020 are presented in
Figure 5,
Figure A2 and
Figure A3, respectively. Their artifact patterns are similar except for the overflowed and truncated pixel colors, which should be affected by the randomness of image batches when we fine-tune the neural codec.
The re-compression artifacts caused by the metric
have easily visible uniform vertical black-and-white stripes, as shown in the fourth row of
Figure A1. We think the pattern may come from the image gradients in the
x direction. The metric
causes fewer artifacts than
but with similar vertical thicker red and light blue strips. The last row in
Figure 4 shows the re-compression result of metric
. The VGGLoss metric causes very few artifacts near image borders, like the “48970” image of the FFHQ dataset, which echoes the findings we mentioned in the previous section; the VGGLoss is not an adequate metric to exploit the neural encoder for a degraded re-compression result.
We provide more destructive-compression results in
Appendix A.2. Pursuing a perfect perceptual metric that 100% matches human visual judgment is a holy grail of the image quality assessment research field. Most studies focus on minimizing perceptual loss, but almost zero studies have been conducted on effectively diverting one perceptual quality from another. This may be an interesting research topic for studying adversarial attacks on learned image codecs.
4.3. Robustness Against Editing Operations
Due to the original image being manipulated for counterfeiting, which will undergo various image distortions, the SLIC must be robust enough for possible image editing. We present our SLIC’s robustness against eight pre-defined editing operations in
Table 4. We tested three selected neural codecs with the perceptual metrics
and
, and the PSNR values were less than 10 in general across the tested SLICs.
Figure 5 shows the Balle2018 +
codec’s re-compression quality degradation as an example; more results are listed in
Appendix A.2.
The robustness of cropping proves that the adversarial perturbations added to the compressed output are translation-invariant. Our SLIC is not impacted by possible distortions such as sharpening, lighting adjustment, and color adjustment during image tampering because these edits will not smooth out the adversarial perturbations but magnify them. Filtering editing operations such as Gaussian blur, median filtering, affine transform, and JPEG compression, will partially filter out adversarial perturbations. Still, the noise attacker simulates this kind of distortion during training. Therefore, our SLIC can still destroy the visual quality after re-compression given challenging distortions such as blurring, rotation, scaling, and JPEG compression. The qualitative results in
Figure 5 demonstrate our SLIC’s robustness in being used as a secure image codec.
As JPEG is a well-known traditional re-compression attack, we wondered whether our SLIC can resist the re-compression attack from modern neural image codecs. We conducted a re-compression attack on the SLIC images
using the original neural image codecs and present the result in
Table 5. From
Table 5, the adversarially trained SLICs, Balle2018 +
and Minnen2018 +
, are robust to vanilla neural codecs, except for Cheng2020. The Balle2018 and Minnen2018 re-compressed SLIC image
will lead to a low PSNR value around 17 with noticeable artifacts. A neural codec with a superior compression rate like Cheng2020 will filter out adversarial perturbations in SLIC images and invalidate security protection. However, if an SLIC is trained with a high-compression-rate codec, e.g., Cheng2020 +
SLIC, its adversarial perturbation will resist all other neural codecs, as indicated in the last row of
Table 5.
Our preliminary result for neural codec re-compression robustness is encouraging, as we do not simulate the neural re-compression attack during training. Future work could consider how to incorporate neural compressor attacks in the attacker network for a more robust SLIC.
4.4. Robustness Against GenAI
In the era of AI-generated content, generative AI (GenAI) tools can manipulate images conveniently and provide more realistic outputs than traditional image editing software. For deepfakes, research efforts like FaceShifter [
61] have developed methods to transfer a target face onto a victim’s image, producing highly realistic results. Online tools such as Remaker AI [
62] provide vivid face swap tools that are freely available to the public. We tested our SLIC with Remaker AI face swap and Stable Diffusion Inpaint [
63] to validate its robustness against GenAI tools.
Figure 6 demonstrates that our SLIC can still damage the image quality of a GenAI-manipulated image after re-compression.
We provide two face swap results and two inpainting results in
Figure 6. After re-compression, only the implanted regions with the target face or generated background are kept; the remaining areas are destroyed with adversarial artifacts. The last column of
Figure 6 shows our SLIC’s robustness in dealing with compound editing, in which an image is rotated, scaled, and then inpainted. If an image is encoded in SLIC format, the whole image is protected unless the tampering behavior transplants a large portion of the region. In that case, almost all the essential information in the protected image is lost, which also means the integrity of the image is protected.
However, if the SLIC-encoded image contains a face we want to protect, the adversarial effect is lost after the victim’s face is re-generated onto the target image. In
Figure 7, the tampered image in the third column comprises a non-SLIC encoded image and an SLIC-encoded face. The adversarial perturbations on the victim’s face are transformed by the generative model’s encoder to a latent space representation and re-generated, aligned, and blended onto the target image. Face-swapping eliminates adversarial perturbations, so the re-compression result remains high-quality. Preserving the adversarial perturbations through the face re-generation process would be an essential research direction to explore.