We report the results of our experimental campaign on synthetic and real face images of different sources, and by employing different metrics and feature representations for the joint analysis of and .
In particular, we considered the image data employed for the work [
2], which have been made available by the authors (
https://osf.io/ru36d/ (accessed on 30 December 2022)). They include synthetic images generated through StyleGAN2 (indicated as
SG2), real images extracted from FFHQ (indicated as
FFHQ), and the high-quality image dataset of human faces used for training StyleGAN2 (
https://github.com/NVlabs/ffhq-dataset (accessed on 30 December 2022)).
Moreover, to diversify the data corpus and test generalization capabilities, we considered additional sets of real images coming from different sources, in particular:
For each image, we applied the inversion process and obtained its reconstructed version; we used the default parameters of the inversion algorithms and fixed the random seed for reproducibility. As a pre-processing step for all images before the inversion, we detected the squared area containing the face through the
https://github.com/davisking/dlib (accessed on 30 December 2022) library, and blurred the background outside that area to retain mostly face information in the input data. If needed, we resized the area (using the
https://pillow.readthedocs.io/en/stable/ (accessed on 30 December 2022) library) to the resolution 1024 × 1024, which is the one accepted by the inversion algorithm. The average time for reconstructing an input image on an NVIDIA RTX 3090 GPU is 90 s.
From visual inspection, it can be noticed that the face attributes of the synthetic image are reconstructed very accurately, while more pronounced discrepancies in the biometric traits are present for real images. In the following, we jointly study the image faces given as input to the inversion process and their reconstructed counterparts, thus, and .
4.1. Metrics-Based Analysis
First, we perform a similarity analysis between each image and its reconstructed counterpart. We considered the following metrics:
Mean Squared Error (MSE): Computes the distance pixel-wise between the two images
where
M,
N are the dimensions of the image (equal for both images);
is the RGB input image and its
reconstructed version.
Structure Similarity Index Method (SSIM): it is a perception-based model that takes into account the mean values and the variances of the two images
with
,
being the mean values of the two images,
,
variance of the two images, and
and
being stabilization factors.
Learned Perceptual Image Patch Similarity (LPIPS) (
https://github.com/richzhang/PerceptualSimilarity (accessed on 30 December 2022)): it is proposed in [
31] and used in [
4] for the same purpose; it computes the similarity between the activations of two image patches for some pre-defined network. A low LPIPS score means that the image patches are perceptually similar.
The results are reported in
Figure 7. We can notice that SSIM histograms do not show a clear distinction among different clusters. Indeed, SSIM is sensitive to the perceivable changes in terms of structural information, which are usually not noticeable in GAN-generated images. On the contrary, we observe that pairs deriving from real images yield generally higher MSEs than the ones derived from synthetic faces (red histogram), making it evident that reconstructing a pointwise equal image of an unknown target is much more difficult. The same happens for the LPIPS metrics, where, following what was observed in [
4], the
SG2 images yield a higher similarity with their reconstructed counterparts.
Moreover, real images belonging to different datasets lead to different distributions, both in terms of LPIPS and MSE. In particular, it is interesting to observe that FFHQ images (blue histogram) present significantly lower values concerning other sources of real images: this may be related to the fact that those images were included in the training set of StyleGAN2, and thus, they are known to the generator.
4.2. Classification Results
We now report the results of the classification analysis performed according to the pipeline proposed in
Figure 4. First, we plot the histogram of the MSE between
and
among the different datasets. In
Figure 8, we observe that both FaceNet embeddings are able to improve the discrimination capability already observed on
Figure 7. In this representation, the images generated with SG2 produce a clear peak around low MSE values, with a rather limited overlap with the other clusters. As expected, the
FFHQ image pairs lie between synthetic samples and other real ones.
The charts reported in
Figure 9 propose an analysis of the individual subsets of landmarks, according to the face area in which they belong (see
Figure 5). These plots allow us to grasp the importance and the specific contribution of different sets of landmarks corresponding to different areas in the face. Indeed, the face line and eyes areas (eyes and eyebrows landmarks) clearly highlight the differences between real and synthetic pairs (see
Figure 9b–d), while the nose and mouth areas are less effective in discrimination (
Figure 9e–f). Anyway, the whole set of landmarks leads to the strongest separation, and is therefore used for further analysis.
The UMAP visualization of the
FN and
LM differential features (
Figure 10) provides a 2D view of the distribution of different pairs: while pairs deriving from real images of different sources essentially overlap, real and synthetic pairs clearly tend to cluster together.
The results obtained with different types of differential vectors are reported in the following
Table 2a–d. We first tested the datasets of real images individually against the
SG2 data (indicated as
* vs.
SG2), as well as their union (indicated as
All vs.
SG2), yielding six different settings reported row-wise in the tables.
We observe that the SVM provides, on average, better results, and also in front of limited computational complexity. In addition, we verified that the Radial Basis Function (RBF) kernel with hyperparameter consistently yields the best performance; thus, we select it as the reference model for the following experimental analyses. In terms of computational efficiency, the training time of SVM models is in the order of milliseconds, thus, it is negligible with respect to the inversion time.
Interestingly, the landmark-based differential analysis yields substantially higher discrimination capabilities with respect to the FaceNet-based one, despite their generally lower dimensionality. In particular, they perform exceptionally well on the
LFW pairs, which are the more critical case for the FaceNet representations. For a better understanding, we report in
Figure 11 a comparative visualization of the landmarks detected in the input and reconstructed faces, and we plot them together to visualize the misalignment. It can be seen that the synthetic StyleGAN2 image pairs present almost overlapping landmarks, while the real ones show irregular displacements of individual landmarks.
In general, the
FFHQ pairs seem to be the harder ones to distinguish from
SG2 pairs also for landmark-based features, possibly because the former were employed for training the StyleGAN2 generator. This is also observed in
Figure 12 and
Figure 13, where the ROC curves of the different classification scenarios and feature representations are reported. Even if the results are very good in all cases,
Table 3 shows that AUC values for the
LM and
LM case remain lower for the
FFHQ data with respect to other real images.
4.3. Robustness Analysis
An aspect of high practical relevance is whether synthetic images are still identified through the inversion-based analysis, even though they are not the direct output of the generator, but undergo successive post-processing. An advantage of facial landmarks is the fact that their detection and localization are rather robust to the operations applied to the images under analysis. FaceNet embeddings are also designed to generalize to different face image scales and conditions.
Since handling the variety of (even slight) potential operations is a known issue for data-driven techniques based on learned features, we now assess the robustness of the classifiers developed in
Section 4.2 when training and testing data are not aligned in terms of post-processing. In this view, we study three routinely applied operations in the lifecycle of digital images, namely resizing, JPEG compression, and social network sharing. For the sake of conciseness, we focus on the case of
FFHQ vs.
SG2, which is the most critical one for the best-performing features.
4.3.1. Resizing
We have tested our discrimination models with inputs at different resolution levels by downscaling and upscaling the images. In particular, we rescale the entire datasets at different scaling factors
, we apply the inversion/reconstruction process for each case, and finally, we compute the differential feature vectors,
FN,
FN,
LM, and
LM. The library used for the resize process is
https://pillow.readthedocs.io/en/stable/ (accessed on 30 December 2022), with the nearest neighbor resample parameter set (
PIL.Image.NEAREST in the code). Since the StyleGAN2 inversion algorithm requires inputs with fixed size 1024 × 1024 pixels, when inverting images of different resolutions, a further cropping/rescaling operation is needed to meet this requirement. In doing so, some details of the image change in terms of quality (see
Figure 14), making the discrimination between real images and fake images in principle harder.
After having scaled all the images, we trained and tested the models with these resized examples.
Table 4,
Table 5,
Table 6 and
Table 7 report the results obtained, where the training scaling factors are reported row-wise, and the testing scaling factors column-wise. Rows and columns with a scaling factor of
correspond to the baseline case (no scaling), where no post-processing is applied to either training or testing images. We notice that the accuracies are generally preserved and they present no dramatic drops, but rather, oscillations around the aligned cases corresponding to the diagonal values. For the majority of classifiers, the average variation over different testing sets does not exceed 2%.
Among the different representations, the FaceNet-based features seem to struggle more with the upscaled images rather than the downscaled ones, occasionally decreasing below 80%. This behavior is reversed for landmark-based features, for which the performances are more stable for upscaling factors and more sensible and oscillatory for downscaling ones. In both cases, the dimensionality of the differential vectors does not have a significant impact.
4.3.2. JPEG Compression
We apply the same robustness analysis for the JPEG compression at different quality factors
. Examples of compressed images and their reconstructions are reported in
Figure 15.
The results are reported in
Table 8,
Table 9,
Table 10 and
Table 11. As for the resizing section, the library used for the compression process is
https://pillow.readthedocs.io/en/stable/ (accessed on 30 December 2022). We varied the quality parameter of the saved image. The baseline case is reported in the ’NO COMP’ rows and columns. Additionally, in this case, all of the feature representations generally retain their accuracies when the training and testing sets are misaligned, as most of the models have an average deviation from the aligned case below 2%.
Interestingly, when observing the results column-wise, we notice that for FaceNet-based features, a stronger JPEG compression consistently degrades the average performance of the classifiers; as opposed to that, LM and LM fully retain their accuracy, thus strengthening the observation that such semantic cues yield an improved robustness to post-processing.
4.3.3. Social Network Sharing
The identification of synthetic media over social networks is a well-known challenge in media forensics [
32], and an open issue for GAN-generated image detection [
33]. Social media typically apply custom data compression algorithms to reduce the size and the quality of the images to be stored on data centers or costumer’s devices, thus hindering post hoc analyses.
We then test the capabilities of the developed classifiers to generalize to the image data shared on social networks. It is worth noticing that in this case, the models are exactly the ones considered in the classifiers developed in
Section 4.2; thus, they are trained entirely on images with no sharing operations. Instead, the testing set is a subset of the recently published
https://zenodo.org/record/7065064#.Y2to3ZzMKdZ (accessed on 30 December 2022) dataset, which is composed of StyleGAN2 images and real images (extracted from FFHQ) before and after the upload and download from three different social networks: Facebook, Telegram, and Twitter. We randomly select 100 synthetic and 100 real images and extract their shared versions through the three platforms.
Examples of shared images and their reconstructions are reported in
Figure 16.
After inversion, we obtain a reconstruction for each of them. If needed, a rescaling operation is applied to fit the input size of the inversion algorithm. We then extract the feature representations and differential vectors, and test them through the corresponding classifiers already trained in
Section 4.2 for the
FFHQ vs.
SG2 scenario.
The results are reported in
Table 12, where the accuracies of each binary classification scenario (one for each platform) are reported column-wise. As highlighted in [
33], all platforms apply JPEG compression (quality factor between 80 and 90), and Facebook also resizes the images by a 0.7 factor.
Interestingly, the landmark-based features yield remarkable performance in all cases, thus demonstrating a high robustness against this realistic kind of post-processing. In particular, they achieve a maximum accuracy in the Facebook case, which is the more critical one for FaceNet-based features, and also for the general purpose deep networks analyzed in [
33].