1. Introduction
The remarkable progress in deep learning, specifically in Generative Adversarial Networks (GAN) [
1], in recent times has enabled the generation of synthetic images of such high quality that they can often be mistaken for genuine images [
2]. While this advancement opens up exciting opportunities across various industries, it simultaneously raises significant ethical issues due to the possibility of misuse. Such misapplications could have profound societal impacts, including disseminating false information, engaging in fraud, committing identity theft, and generating harmful or inappropriate content [
3]. As deep learning evolves at an unparalleled pace and AI-generated images become more lifelike by the day, establishing robust techniques to differentiate between real and AI-created content becomes imperative. This is essential to counteract the adverse effects and maintain the trustworthiness and integrity of online information.
Other than visual inspection, an unreliable method for determining fake images today, there are two main approaches for detecting fake images. These include hand-crafted feature extraction using image processing techniques and deep learning [
2]. Early methods for detecting fake images resulting from image tampering aimed to identify a specific tampering technique, such as splicing or copy–move [
4]. These tampering techniques typically involve altering certain parts of an image to create or modify objects within it. Early fake-detection approaches were usually based on extracting features from the frequency domain. For example, the approach proposed in [
5] divides the image into overlapping blocks and detects the copy–move forgery by matching the features extracted from these blocks using discrete cosine transformation. Similarly, the method proposed in [
6] for feature extraction from the frequency domain obtains low-frequency components using the discrete wavelet transformation. At the same time, singular value decomposition is applied to these components to obtain the feature vectors. However, this method is time-consuming and sensitive to image objects that are rotated, scaled, or blurred. To improve the efficiency of the previous method, the same author proposed in [
7] the usage of Fourier–Mellin transformation. Transformation is used to mitigate sensitivity to geometric operation while the whole detection process is accelerated using the Bloom filter. Nevertheless, when introduced to image splicing, the mentioned methods perform poorly, as image splicing involves integrating segments from different sources, each with distinct textures and features. The method proposed in [
8] extracts the color filter array pattern from the image. The statistical analysis of local inconsistencies within these patterns is then performed to differentiate between original and fake regions. In [
9], the classification model is used to identify irregularities introduced by splicing in images, thereby identifying the fake ones. Feature extraction is performed by a discrete cosine transformation, as splicing often disrupts the frequency patterns of the original image, while Markov features are derived from obtained transform coefficients.
However, with the development of sophisticated GANs, images are often altered using multiple tampering techniques simultaneously, creating more realistic-looking images that are difficult to identify as manipulated, making it challenging to pinpoint both the nature of the tampering and the specific regions affected in the image. Thus, the previously effective methods based purely on feature extraction can no longer accurately detect these manipulations [
10]. This can be overcome with deep learning approaches such as convolutional neural networks (CNNs). Convolutional neural networks are centered around various architectures of deep artificial neural networks featuring convolutional and pooling layers followed by several hidden layers of neurons. These layers progressively extract higher-level features from the raw input, such as images. By enhancing the quantity of the layers, these methods can approximate more intricate decision functions, thereby attaining superior classification accuracy [
11,
12]. Nevertheless, today, high-performing CNNs, such as VGGNet [
13], DenseNet [
14], and ResNet [
15], consist of a high number of layers that increase the need for a large quantity of data and the complexity of the training procedure due to the presence of multiple local optima and an extensive number of hyperparameters. Additionally, they are also considered to be black-box function approximates that do not allow for an explanation of their decisions [
16]. For example, VGGNet consists of 16 layers, including 13 convolutional layers and 3 fully connected layers with several thousands of neurons, while ResNet and DenseNet are more than 100 layers deep. In contrast, a lightweight CNN for image forgery detection was proposed in [
17], which includes a standard 3 × 3 convolutional layer followed by 17 bottleneck layers of 1 × 1 and 3 × 3 convolutions. Compared to a standard convolution operating across a large kernel, using small kernels such as 1 × 1 enables higher computational efficiency. Another lightweight CNN with 3 convolutional layers was proposed in [
18], with 32, 64, and 128 filters, respectively, and a 3 × 3 kernel size for forgery detection in images. However, current lightweight methods cannot extract high-quality features from input images compared to deep CNNs due to a small number of layers and kernel sizes, thus not allowing for high classification performance.
Central to our methodology is a deep learning approach based on CNN architecture. Several important scientific contributions with multidisciplinary and social implications arise from this study. These contributions include the following:
A high-performance CNN-based method consisting of only eight convolutional and two hidden layers enhances the human ability to identify AI-generated images using computer vision. The method was evaluated on two benchmark datasets as well as on custom-generated data based on Sentinel-2 imagery;
An approach for creating high-quality synthetic visual content through transfer learning based on publicly available StyleGAN framework [
19]. Specifically, we generated synthetic satellite images based on Sentinel-2 imagery for this study;
The utilization of Explainable AI (XAI) to deepen our understanding of the intricate processes involved in synthetic image recognition.
The mentioned scientific contributions represent a significant leap in addressing the intricate challenges arising from the rapid development of modern technology. This progress holds vital implications for upholding the trustworthiness and accuracy of data, which is crucial in an era heavily reliant on digital information.
The rest of the paper is structured as follows.
Section 2 provides background on synthetic image generation and detection algorithms. In the most research-intensive
Section 3, we first summarize creating synthetic data using StyleGAN. In continuation, we described the proposed CNN architecture. In
Section 4, we show the results and discuss the work carried out, as well as the interpretation of the resulting model.
3. Materials and Methods
In this section, we present a CNN-based method that consists of eight convolutional and two hidden layers. We begin with data preparation, followed by a detailed explanation of how to create synthetic datasets for fake image detection containing both generated and real satellite images using GAN architecture. Additionally, we discuss using GAN architecture along with training details and a quantitative evaluation of generated image quality using standard metrics.
3.1. Data Preparation
For this study, we have created a new dataset specially optimized for GAN training. Firstly, we collected images portraying various locations from across the globe, categorized into cities, coastal areas, deserts, lowlands, and tundras from the Copernicus Data Space Ecosystem [
43].
Figure 1 displays sample satellite images for each category. This diversity ensures that the GAN learns to generate a wide range of synthetic satellite images with different features of terrain, vegetation, or human settlements.
We extracted slices of size 256 × 256 pixels from a satellite image in GeoTIFF format. Each of these slices represents a separate and unique geographic entity within a larger satellite image, allowing for an increasing variety and amount of data for training a GAN by generating extra images from a single original satellite image (see
Figure 2).
After acquiring the slices, it is crucial to carefully review and select the proper ones to be used for GAN training. As satellite images often contain disturbances such as distortion, unwanted objects, clouds, inadequate lighting, and other factors that impair image quality, including such images can cause the model to learn to reproduce these disturbances instead of focusing on the key features and details that need to be generated. Finally, after the cleaning, we acquired 50,000 slices of satellite images in PNG format. Additionally, to increase the dataset’s scope, 90° rotations and vertical and horizontal mirroring were applied to the collected images. Several examples of such transformations are shown in
Figure 3.
Overall, the GAN training dataset consisted of 100,000 images, of which 50,000 were collected, and 50,000 were created by transformations of the collected images, thus ensuring a sufficient amount of diverse data for efficient learning and generating new satellite images.
3.2. Synthetic Data Generation
In order to generate synthetic satellite images, we rely on StyleGANv3 architecture, which is an advanced version of the StyleGAN framework created and released by NVIDIA [
21]. StyleGANv3 is recognized for its remarkable capacity to produce top-notch images. The continuation of this subsection presents the step-by-step procedure for setting up and training a GAN model using images collected and processed, as presented in the previous subsection.
The training of a GAN can be demanding and time-consuming, as it depends on the complexity and size of the dataset and the available computing resources. During training, unwanted features can appear in the generated images, usually due to insufficient data quality, poor neural network architecture, or inappropriate hyperparameters such as learning rate, batch size, and the number of training epochs. One of the widespread challenges during the training of GANs is mode collapse. Mode collapse is a problem wherein the generator begins to generate images with limited variety, focusing only on certain features of the training data rather than representing a wide range of outputs that match the actual data. The generator has a more demanding task, as it has to learn how to turn random noise into meaningful images that can fool the discriminator. On the other hand, the discriminator has a more straightforward role of distinguishing actual images from synthetic ones, which is why it is trained faster and usually starts to dominate the generator.
Given that training a GAN can take days, weeks, or even months, it is often inefficient to train a model from scratch. In order to speed up training, a transfer learning technique can be applied. This implies using a previously trained model as a starting point for a new task. Such models have already adopted useful features and representations that are necessary for image generation, which can significantly shorten training time and enable the faster achievement of quality results. The publicly available StyleGANv3 model, trained on diverse landscapes, was obtained from [
44]. Landscapes share many similarities with collected satellite images, such as different terrain shapes, colors, and textures, which allow previously learned model features, such as edge detection and natural pattern recognition, to be successfully applied to satellite images during further training.
After applying the transfer learning technique, the discriminator gained too much dominance over the generator, causing the balance between them to collapse. In [
45], the authors point out that the training process is considered complete when the generator generates images that are so realistic that the discriminator cannot reliably distinguish them from the real ones better than random guessing. This balance was achieved when the batch size was set to 32 samples, and the learning rate was set to 0.001 for the generator and 0.0001 for the discriminator to give the generator a slight advantage.
The StyleGANv3 model uses two additional parameters, ticks and kimgs, for transfer learning. Kimgs represents the number of kilo images per tick, where each tick corresponds to a fixed number of training iterations. In our case, the StyleGANv3 model was trained for 900 ticks, with four kimgs per tick, which amounts to approximately 3.6 million images processed during the training phase. Given that the training dataset consisted of 100,000 unique images, it can be inferred that each image was used an average of 36 times during the training process. Next, we present the assessment of the results of StyleGANv3 transfer learning in the form of generated image quality and the losses of generator and discriminator.
3.2.1. Generated Image Quality
While training a GAN, monitoring the progress and quality of the generated images is critical. In addition to visual inspection by the human eye, specific metrics can be used to evaluate the quality of GAN-generated images. For this purpose, we used Fréchet Inception Distance (FID) [
46] for assessing the quality of images produced by the model. It is a valuable tool for determining the similarity between generated and real images. FID measures the similarity between the features of generated and actual images, and it is defined as
where
Therefore, FID compares the means and covariances of extracted feature distributions, while lower FID scores indicate that the generated images are closer to the actual images.
We follow the feature extraction procedure described in [
47]. The FID calculation begins with extracting features from generated and actual images using a pre-trained Inception v3 network, i.e., a convolutional neural network intended for image classification proposed in [
48]. Then, the FID values are calculated based on extracted features according to Equation (
1).
In
Figure 4, the graph illustrates the fluctuation of the FiD throughout the StyleGAN training process. The x-axis represents
kimgs, ranging from 0 to 3600, while the y-axis denotes the FID value. At the onset of training, the FID exhibits a notably high value of approximately 170, indicative of significant dissimilarities between the generated images and those in the dataset. Notably, applying transfer learning techniques leads to a steep decline in FID during the initial training phases, signifying rapid enhancement in the quality of the generated images. Subsequently, as the training progresses, the rate of decline in FID attenuates, exhibiting a consistent, gradual reduction with minimal oscillation. Ultimately, upon completing the 3600
kimgs training, the FID stabilizes at a value of less than 25, signaling a substantial improvement in the fidelity of the generated images.
3.2.2. Generator and Discriminator Losses
The roles of generator and discriminator losses are pivotal in the training of GANs. The generator loss measures its ability to deceive the discriminator and generate authentic-looking images. The objective is to maximize the likelihood of the discriminator classifying the generated images as real while minimizing the generator loss throughout the training process. The losses of StyleGANv3 for discriminator and generator are defined in [
21]. Conversely, discriminator loss signifies the discriminator’s aptitude to differentiate between real and generated images. The discriminator aims to minimize its loss, indicating its tendency to accurately classify real images as real (resulting in low loss) and generated images as generated (resulting in high loss). A low overall loss for the discriminator indicates its successful discrimination of real and generated images.
In
Figure 5, the fluctuations in loss values for both the generator (represented by the green line) and the discriminator (illustrated by the red line) during the training of the model are visualized. The
x-axis denotes the number of
ticks from 0 to 900, while the
y-axis represents the corresponding loss values. At the outset of the training, both the generator’s and the discriminator’s loss values are notably high and exhibit oscillatory behavior. However, as the training progresses, the losses for both components gradually diminish and stabilize. Although oscillations persist, their amplitude decreases. No singular dominance between the two components is evident. It is worth noting that the discriminator consistently exhibits lower loss values compared to the generator, which is a common characteristic in GANs, reflecting the initial ease with which the discriminator can distinguish real from generated images.
3.3. Proposed Method
Here, we proposed a new CNN architecture for synthetic image detection. This approach is characterized by an intricate architecture that amalgamates two principal networks through a series of intermediate operations. Specifically, these networks comprise the convolutional layers, which are responsible for the initial processing and feature extraction from the images, and the fully connected layers, which play a critical role in the classification process based on the features identified. This refined framework is adept at capturing the intricacies and subtleties present within images, thereby facilitating a more nuanced and accurate classification outcome. The employment of a CNN in the image classification task exemplifies the confluence of advanced computational techniques and theoretical insights, heralding a more refined and practical approach to visual data analysis.
The architecture of the proposed CNN model is structured with multiple layers, each designed to methodically identify and analyze features from the input images, ultimately leading to a fully connected layer for the classification task. This architecture is displayed in
Figure 6.
The input layer accepts images of dimensions 3 × H × W, indicating three color channels across height and width. At the same time, each convolutional block in the architecture is comprised of a sequence that includes two convolutional layers followed by batch normalization and the application of ReLU activation functions. The initial convolutional layer within each block employs a
kernel for feature extraction. The incorporation of batch normalization aids in stabilizing the training process and expediting convergence, while the ReLU activation function introduces non-linearity, allowing for the capture of complex patterns. Furthermore, since batch normalization is computationally intensive [
49] and the proposed CNN architecture is lightweight, using only one such operation per individual convolutional block was sufficient to ensure stability training, as demonstrated in
Section 4. Max-pooling layers with a
window are used to reduce the feature maps both in spatial dimensions and computational demand. Additionally, a dropout layer with a
rate is applied to overcome overfitting, randomly disabling a portion of the neurons during the training phase.
Features are extracted in the network through four convolutional blocks, progressively increasing the number of filters (64, 128, 256, and 512) to capture more abstract and higher-level features at deeper layers. An Adaptive Average Pooling layer follows the final convolutional block, reducing each feature map to a size of , thus converting the spatial dimensions into a single vector representation. This vector is then input into a fully connected neural network comprising two hidden layers, each with 512 neurons. To further mitigate overfitting and boost the network’s ability to generalize, dropout is applied between the fully connected layers. The output layer employs a sigmoid activation function suited for binary classification tasks by producing a probability score for each class.
4. Results and Discussion
4.1. Validation Setup
The proposed method was implemented in the Python programming language on the Microsoft Windows 11 operating system, while all conducted experiments were performed on a workstation with AMD Ryzen 7 3800 × CPU and 32 GB of main memory. In order to ensure the reproducibility of the experiments, the proposed method relied on the PyTorch machine learning library. In this paper, 70% of the data were used for training the model, 15% for the validation set, and the remaining 15% for the test set. Sample images were selected randomly to ensure the unbiasedness and representativeness of each set.
In addition to generated satellite imagery, two benchmark datasets were also used to validate the proposed method. These were CIFAKE [
50] and Midjourney v6 [
51]. As presented in
Section 3.1, we created a synthetic dataset of 100,000 images, where 50,000 images were generated with StyleGANv3 and concatenated with 50,000 real ones for training the StyleGANv3. CIFAKE, on the other hand, is a publicly available dataset containing 60,000 synthetic and 60,000 real images. The real images comprise the CIFAR-10 dataset, while the synthetic images were generated using the Stable Diffusion 1.4 model [
50] and scaled to 32 × 32 pixels. Similarly, Midjourney v6 [
51] consists of a total of 50,000 images, 12,500 of which were synthetic images generated by the authors, and the remaining 12,500 were real images collected from public datasets by the authors and scaled to 64 × 64 pixels. The remaining 25,000 images are created by image transformations such as vertical and horizontal mirroring and rotation.
The data allocation into training, validation, and test subsets for each dataset, together with image sizes, is presented in
Table 1. All three datasets used in this paper contain an equal number of real and AI-generated images, thus ensuring data balance.
The evaluation of the proposed method on these datasets is based on the confusion matrix [
52], with the following metrics to assess the performance of learned classification models:
precision, defined as ;
recall, defined as ;
accuracy, defined as ; and
F1 score, defined as ∗ ,
where TP denotes the number of correctly classified positive examples, TN is the number of correctly classified negative examples, FP denotes the number of incorrectly classified positive examples, and FN is the number of incorrectly classified negative examples.
4.2. Performance Evaluation
In order to assess the actual performance and generalization ability, the proposed method was evaluated on a separate test dataset. The test set, which is 15% of the total dataset, set apart before starting training, consists of unseen images that serve as a final assessment of the network’s ability to classify unknown examples accurately. The models were trained and tested using different random data for each of the 10 training sessions. The images were randomly selected for training, testing, and validation for each training session. This ensured that, in every session, different subsets of images were used for training, validation, and testing purposes. We set the number of training epochs to ten for the Midjourney v6 and StyleGANv3 in order to ensure stable convergence without encountering issues like overfitting or underfitting as, at the 11th or 12th epoch, the training loss continues to decrease, but the validation loss increases and the overall accuracy does not improve on the test set. This was noted also for CIFAKE after the 15th epoch. For the test sets, we averaged accuracies from all 10 training sessions and calculated the associated standard deviations. The results are presented in
Table 2. On the other hand, loss and accuracy curves for all three datasets are displayed in
Figure 7,
Figure 8 and
Figure 9 from a training session that yielded the highest accuracy on the test set.
Figure 7 shows the loss versus accuracy curves of the model on the CIFAKE dataset. The losses on the training and validation sets decrease as the number of epochs increases, indicating the successful learning and generalization of the model. The accuracy on both sets increases, reaching high values after a few epochs, while the learning process is relatively stable and has minimal divergence. However, in the first eight epochs, the accuracy on the validation set is higher than that on the training set. This is likely due to the relatively high dropout rate of 0.3 that is active during training [
11]. As training progresses, the accuracy of the training set surpasses that of the validation set. This indicates that the model fits the training data more closely, including any noise or anomalies introduced in the training set. On the testing sets of the CIFAKE dataset, the model achieves an average accuracy of 96.60%, precision of 96.66%, and recall of 96.53%. These high values indicate that the model correctly classifies a large percentage of images, with very few false positive and false negative examples. The average precision of 96.66% indicates the high ability of the model to detect AI-generated images as fake correctly, while the recall of 96.53% shows the model’s capacity to identify 96.53% of instances of fake images, ensuring that only 3.47% of fakes are undetected. A high F1 score of 96.59% confirms the balance between precision and recall.
Similar results can be seen for the Midjourney v6 dataset. In the first six epochs, the accuracy on the validation set is higher than that on the training set, while, as training progresses, the accuracy of the training set surpasses that of the validation set. The loss and accuracy curves for the model on the Midjourney v6 dataset display a significant decrease in loss on the training and validation sets during the first few epochs, after which a slight convergence occurs (see
Figure 8). In contrast, the model’s accuracy on the training and validation data shows a sharp increase in the initial epochs, after which it stabilizes and continues to increase at a lower rate. The model gradually achieves very high average results on the test sets with an accuracy of 99.94%, a precision of 99.63%, a recall of 99.79%, and an F1 measure of 99.20%. A recall close to 100% suggests that the model almost perfectly identifies all real images.
Interestingly, on the StyleGANv3 dataset, the validation set achieves better performance than the training set, as shown in
Figure 9. This phenomenon can be attributed to the application of regularization techniques such as dropout layers in the network architecture. Dropout layers randomly turn off a certain percentage of neurons during the training process, thereby reducing the possibility of overfitting the model to the training data and improving the ability to generalize to unseen examples [
11]. When we train it for another five epochs, the accuracy of the training set surpasses that of the validation set, which starts to decrease, indicating that 10 epochs is sufficient for optimal training. Moreover, the average results achieved on the test sets were 98.69% accuracy, 98.31% precision, 99.11% recall, and an F1 measure of 98.70%. Both precision and recall are very high and balanced, indicating that the model makes very few false positive (FP) and false negative (FN) errors. In other words, when the model predicts that an image is AI-generated, it is correct in 98.31% of the samples.
As we generated a dataset with StyleGANv3 and results for Midjourney v6 have not been available in published research since it was only released at the end of the previous year, we compared the proposed method on the CIFAKE dataset with five state-of-the-art methods. The comparison includes CIFAKE CNN [
50], Dual-Input Neural Model (DINM) [
32], while the results from the ResNet [
15], VGGNet [
13], and DenseNet [
14] are obtained from [
53].
Table 3 summarizes the results.
While DenseNet set the benchmark by attaining the highest level of accuracy, reaching 97.74%, and thus emerged as the most effective model among those evaluated, the proposed method is just a little behind, showcasing commendable performance at 97.32%. VGGNet also demonstrated strong capabilities, achieving an accuracy of 96.00%. In comparison, CIFAKE CNN, DINM, and ResNet reported somewhat lower accuracies, with values of 94.80%, 94.51%, and 94.95%, respectively. In summary, both DenseNet and the proposed methods surpassed other evaluated approaches in terms of effectiveness when applied to the CIFAKE dataset. This overall performance comparison suggests that the proposed method holds significant promise for future research and applications within this domain, marking a noteworthy advancement in the field.
4.3. Explaining the Decisions of the Proposed Method
Finally, we discuss the explanations of the proposed method decisions using Grad-CAM++ (v.1.2.4), the approach described in [
42]. Explanations were provided for all three datasets used in this research, while some samples of real and fake images used for explanation were randomly selected. As discussed in
Section 2.3, Grad-CAM++ creates a heatmap to pinpoint the importance of different image regions that influence the decision-making process, where warmer colors indicate greater importance for a specific class and cooler colors are less important. Moreover,
Figure 10,
Figure 11 and
Figure 12 show Grad-CAM++ activations for each of the eight convolutional layers of the proposed network architecture. In those figures, examples (a) depict activations for real images, while examples (b) show synthetic images. Each row starts with the original image without the heat map, followed by subsequent images showing the heat maps of the activations for a specific convolutional network layer. These visualizations are from the last epoch of model training and illustrate the model’s ability to distinguish between real and synthetic images.
As can be seen from
Figure 10, the main objects in real images predominantly contribute to the recognition of the real images, while activations for synthetic images are more diffused and usually do not focus on the image object itself but on small visual irregularities in the background. This is also aligned with the previous research, for example, in [
50]. We observe similar patterns for the Grad-CAM++ activations in
Figure 11 and
Figure 12. In the initial convolutional layers, the activations are small and spread across the image, indicating that the network initially detects general visual features in different parts of the image. However, as we progress through the network layers, the activations become more focused and concentrated in specific areas of the real image. The observation suggests a distinctive difference in how activations behave in real versus synthetic images. In real images, activations focus on the main object, presumably due to their clear, central role in the composition. Conversely, synthetic images show activations clustering not around the primary object but around the peripheries, including object edges, backgrounds, or areas with complex textures. This indicates a fundamental divergence in how real and synthetic images are processed or represented, possibly due to the inherent differences in their composition and the elements that are more pronounced or defining in each type.
Generally, we notice that, when the network converges to the optimal solution, the heatmaps of activations provide more useful insight into how the neural network learns through different layers and training epochs. Convergence is manifested in the reduction of the magnitudes of gradients and activations in later epochs, indicating that the network has successfully learned the essential features and patterns from the training data. Activations become smaller and more precise, and those that contribute the most to the classification are usually found in the last layers of the network, where very abstract and complex information is extracted.
On the one hand, the differences in activation patterns between real and synthetic images can serve as a guide for the next generation of architectures for generative models and advanced training techniques for this purpose. Furthermore, in the face of the growing prevalence of deepfakes and AI-generated media, these discoveries can play a vital role in the development of tools to counter misinformation. An in-depth comprehension of the variations in activation patterns at the AI-synthesized image level can provide substantial ability to platforms and regulators to identify and effectively mitigate the dissemination of fake content.
5. Conclusions
Our research demonstrates a substantial advancement in application, especially through the use of CNN architectures, to address the complex challenges in digital image recognition and generation. By creating a high-performance CNN-based method, we not only improve human ability to identify AI-generated images with greater accuracy but also, by using explanations, move closer to understanding the subtle processes of synthetic image recognition, thus bolstering the reliability and accuracy of digital data. The proposed method holds great promise for future research and applications in this field, as its accuracy is slightly below that of the best-performing method. Finally, explanations of the models indicate that activations in real images are often focused on the main object, while, in synthetic images, activations tend to be clustered around the edges of objects, in the background, or in areas with complex textures.