The experimental results section describes all experiments that were conducted and obtained results on two different datasets. The first dataset is the manually collected and evaluated MANFA dataset [
13], and the second dataset is the PGGAN dataset [
7]. All experiments are implemented on an NVIDIA DIGITS toolbox with a pre-installed Ubuntu 16.04. It contained an Intel
® Core i7-5930K processor, four 3072 CUDA cores, four Titan X 12GB GPUs, and 64GB of DDR4 RAM.
Section 4.1 explains the evaluation metrics used in this research, including AUC score, precision, and recall. Then, the first experiment is carried out to validate the proposed model performance on a balanced dataset as shown in
Section 4.2. A visualization of detected tampered regions is implemented in
Section 4.3 to explain why the model classifies an input image as manipulated image. After that, the performance of applying three different approaches to the proposed model for solving the IDP was described in
Section 4.4.
Section 4.5 shows the performance of the proposed model on a GAN dataset.
4.1. Evaluation Metrics
The prediction output of the system for an input image is either real or tampered, so it is a binary classification problem. The performance of the system is usually represented in a confusion matrix, which is given in
Table 2.
After the confusion matrix was constructed, accuracy, precision, and recall are computed to investigate the proposed model performance. Accuracy refers to the proportion of correctly classified samples (TP and TN) among the total samples in the test dataset. Accuracy cannot provide a thorough evaluation of the model. Therefore, precision and recall are two widely used additional measurements.
where TP, FN, FP, and TN are the corresponding true positive, true negative, false positive, and false negative values for one class against the other class, which were taken from the confusion matrix.
Based on the acquired values from the confusion matrix, true-positive rate (TPR) and false-positive rate (FPR) measurements are computed. The TPR is the same as recall, whereas FPR is shown in the following equation:
A receiver operating characteristic (ROC) curve [
26] is illustrated with TPR against the FPR for separate cut-off points. Moreover, in the multi-class classification, every false prediction is an FP for a class, and every single negative is an FN for a class. Each point on the curve depicts a sensitivity/specificity set that correlates with a particular decision threshold. The area under the ROC curve, or AUC [
26], is usually applied to estimate the performance of the proposed classification model. If a ROC curve for class 1 (C1) has a higher AUC value than class 2 (C2), then the proposed classifier C1 is considered to achieve a better performance than C2.
4.2. Balanced Dataset Experiment
The initial experiment is conducted to examine the performance of TFID model on the balanced MANFA dataset (Dataset 1) for the tampered face images identification task. The training dataset contains 4200 tampered images and 4200 real images, which were randomly taken from the original MANFA dataset. 4-fold cross-validation is then implemented on the extracted dataset by dividing it into four subsets, and each subset contains 2100 images. The number of tampered and real images are shown in
Table 3. For each fold, three subsets are used as the training dataset, and the remaining subset is used for testing purposes. Within the training dataset, 80% of the training data are used to train the proposed model, and the rest of the images are utilized as a validation dataset to validate the trained model.
Face regions are first localized and extracted based on a python implementation of a facial landmark algorithm proposed by [
22]. This study applied a 5-point facial landmark (2 points for the left eye, 2 points for the right eye, and 1 point for the nose) because it has been proved to be 8–10% faster than the original 68-point detector [
27]. Then, localized face images are rotated and aligned to frontal to remove pose changes. The facial landmark algorithm is implemented with dlib library version 19.18.0. Next, OpenCV library version 4.1.1 is used to perform the face rotation and resize all images to 256 × 256. Python programming language and Keras neural-network library are used to implement the proposed models. The optimization function in the proposed TFID is Adam optimization with the learning rate is set to 0.001 initially as recommend by [
28] for model with a small number of convolutional layers. The batch size is set to 32 because for Adam optimizer the smaller batch size can increase the test accuracy [
29]. The model is trained through 50 epochs. The validation accuracy, validation loss, training accuracy, and training loss for each fold are provided in
Figure 7.
The training accuracy and the validation accuracy increase dramatically to over 79%, whereas training loss and the validation loss decline significantly to 32% after the 7th epoch. During the remaining epochs, the training accuracy and the validation accuracy rose steadily and reach a peak of 83%. Robust results are observed in fold 3 regarding validation accuracy and validation loss. In contrast, other folds fluctuate in validation accuracy and validation loss.
The proposed model is also compared with pre-trained SE-ResNet-50 model and VGGFace model, which have achieved state-of-the-art performance on VGGFace2 dataset [
3]. The reason these two models were selected is that they are trained on huge dataset related to human facial features. Therefore, human face features help the pre-trained models optimized faster on MANFA dataset, which is also related to the human face. We set the hyper-parameters as suggested by [
3,
30] for pre-trained VGG16 and SE-ResNet-50 models to enable a good trade-off between bias and variance. A performance comparison between TFID, VGG16, and SE-ResNet-50 models are shown in
Table 4.
In general, all three models performed well on the balanced MANFA dataset. The obtained results showed that the VGG16 model achieved an accuracy of 81%, precision of 78%, recall of 84%, and an AUC value of 0.83, while the TFID model obtained a higher accuracy of 83%, precision of 81%, recall of 89%, and an AUC value of 0.86. On the other hand, the pre-trained SE-ResNet-50 model witnessed the highest classification performance with an accuracy of 84.7%, precision of 82%, recall of 91%, and an AUC value of 0.89. The classification performance of the proposed model is comparable to the state-of-the-art VGG16 and SE-ResNet-50 models. Therefore, the TFID model has the potential to deal with a tampered face images identification task. Based on the result on
Table 4, TFID and SE-ResNet-50 models are used in the next experiment because they performed better than the VGG16 model.
4.4. Imbalanced Dataset Experiment
In this section, an experiment is conducted to evaluate the performance of three different extensions of the TFID model to deal with the IDP. They include XGBoost from the ensemble-based approach, class weight from cost-sensitive learning approach, and data-based approach.
The proportion of tampered images to real images ranging from 1/1 (balanced dataset) to 1/100 (highly imbalanced dataset) is applied to the MANFA dataset. A total of 2000 tampered images and 200,000 real images are chosen from the MANFA dataset.
Table 5 depicts the number of real and tampered face images for each imbalanced case.
For the ensemble-based approach, the output 9216 feature vectors from the flatten layer are extracted, whereas 2048 feature vectors are extracted from the SE-ResNet-50 model. After that, XGBoost classifier is trained based on these extracted features. The learning rate for XGBoost is set to 0.1, the number of trees to fit is 100, and the maximum tree depth for base learners is 3.
For the cost-sensitive learning-based approach, every sample from the tampered class is considered as n instances of the real class. Therefore, a classifier is forced to treat the tampered class and the real class equally. This assumption is implemented by using the class_weight parameter from Keras library, which assigns a higher loss to the tampered class to make the classifier focus more on samples from tampered class. The class_weight for each class was set different according to the proportion of tampered images to real images. For example, when the proportion of tampered images to real images is 1/100, class_weight is set 100 for the tampered class, while class_weight for the real class is 1 to force the model to treat every instance of the tampered class as 100 instances of the real class. On the other hand, when the proportion of tampered images to real images is 1/1, which indicate a balanced dataset, class_weight is fixed to 1 for both tampered class and real class.
For the data-based approach, data augmentation transformation, including horizontal flip, horizontal and vertical shift, brightness, zooming, noise addition, random rotation within 10 degrees is implemented. After that, it is integrated into the python imbalanced-learn library to create a balanced batch generator, which ensures that the number of samples per class always follows a balanced distribution.
Finally, eight models, including TFID, data-based TFID classifier (TFID-SA), cost-sensitive learning-based TFID classifier (TFID-CW), ensemble-based TFID classifier (TFID-XGB), SE-ResNet-50, cost-sensitive learning-based SE-ResNet-50 classifier (SE-ResNet-50-CW), data-based SE-ResNet-50 classifier (SE-ResNet-50-SA), and ensemble-based SE-ResNet-50 classifier (SE-ResNet-50-XGB) are implemented. The performance of each model on various imbalanced dataset settings are shown in
Figure 9.
When the proportion of the real images to the tampered images is 1/1 (balanced dataset), all models achieved an AUC value of over 0.8. Moreover, the TFID-XGB and the SE-ResNet-50-XGB models reached a slightly higher performance compared to TFID and SE-ResNet-50. The AUC values became lower when the proportion of the real images to the tampered images increased because TFID and SE-ResNet-50 models focused on the features from the majority class and overlooked features from the minority class.
The effect IDP can be observed under the extreme setup when the proportion of the real images to the tampered images was 1/100. The AUC values of the TFID and the SE-ResNet-50 models plummeted to 0.59 and 0.6, respectively. However, the results obtained from ensemble-based models (XGB) were more robust and remained over 0.8 compared to the other extensions of TFID. Among two ensemble-based models, TFID-XGB achieved an AUC value of 0.92, and SE-ResNet-50-XGB reached an AUC value of 0.88 with the highly imbalanced ratio of 1/100. In addition, the cost-sensitive learning approach, which includes TFID-CW and SE-ResNet-50-CW models, also witnessed a high AUC value between 0.76 and 0.88. We noticed that the data-based hybrid models (TFID-SA and SE-ResNet-50-SA) performance decreased gradually as the number of real images increased, and the TFID-SA and SE-ResNet-50-SA reached their lowest AUC value at 0.64 and 0.71, respectively, when the proportion was 1/100. The main reason that led to the poor performance of the data-based approach is that the generated images using the augmentation technique were just the extension of the original images. Thus, it can lead to the overfitting problem [
31,
32].
After calculating the AUC value, macro-precision, macro-recall, and macro-f1 are computed. These measurements are usually computed when we want to evaluate the performance of the system on different datasets. Moreover, these measures are invariant with respect to the IDP. Macro-precision and macro-recall are computed by averaging the precision and recall of a classifier on different datasets.
Table 6 shows the computed macro-precision, macro-recall, and macro-f1 of 8 different models in different imbalanced dataset settings.
The highest macro-f1 value belongs to SE-ResNet-50-XGB model, whereas the proposed TFID-XGB model achieves the macro-f1 value of 0.887. Obtained results confirm that the ensemble-based approach using XGB is the most effective way to deal with the IDP.
In the previous section, the ensemble-based extensions of TFID and SE-ResNet-50 models outperformed other approaches because the AUC values always remained over 0.8, even in the most imbalanced scenario. This experiment is conducted to compare these models in terms of computational complexity to check which model requires the lowest testing time and which model demands the highest testing time. The testing time per image of eight different models, including TFID, TFID-CW, TFID-XGB, TFID-SA, SE-ResNet-50, SE-ResNet-50-CW, SE-ResNet-50-XGB, SE-ResNet-50-SA on MANFA dataset are shown in
Figure 10 (the proportion of tampered images to real images is 1/10).
As shown in
Figure 10, the obtained results confirm that the testing time per image of SE-ResNet-50, SE-ResNet-50-SA and SE-ResNet-50-CW models is about 3 s. In addition, SE-ResNet-50-XGB requires 4.2 s per image, which is the longest time among the extensions of SE-ResNet-50 models. In contrast, TFID and TFID-CW models have the shortest testing time per image (about 0.8 s). The testing time per image for TFID-XGB model is longer at 1.5 s. The ensemble-based models require more computing power because the tampered features must be extracted from the TFID or the SE-ResNet-50 model. Then those features are fed into the XGBoost classifier for classification. However, it is a fair tradeoff because the model performance is significantly increased.