1. Introduction
Image inpainting, the process of reconstructing lost or deteriorated parts of an image, is essential in fields such as image editing and restoration. Although inpainting methods have advanced significantly, as can be observed in some thorough reviews done in [
1] and [
2], detecting the presence of inpainting remains a critical challenge [
3]. The increasing sophistication of these techniques necessitates reliable tools that can identify even subtle modifications. The proliferation of image manipulation has spurred significant growth in multimedia forensics and related disciplines. Consequently, a plethora of methods and tools have emerged for detecting and locating image forgeries. As can be seen from systematic reviews of forgery detection papers [
4], while this field has expanded rapidly, current research primarily concentrates on the identification of deepfakes [
5]. Although some substantial advancements have been achieved in detecting image copy-move or splicing, current state-of-the-art methods for inpainting detection remain insufficient for practical use in real-world situations. This deficiency is due to several critical challenges that are still being tackled:
Limited generalizability across various image datasets
Limited realist
The absence of a comprehensive image dataset that encompasses multiple inpainting techniques
Limitations to image variations
Insufficient detection accuracy
The fast-paced evolution of inpainting technologies
To address the above limitations, this paper proposes a novel inpainting detection framework that utilizes a wavelet scatternet based on Dual Tree Complex Wavelet in conjunction with CNN neural networks. We propose a method that combines wavelet scattering with a UNET++ architecture with an EfficientV2-like encoder. Additionally, to enhance the Intersection over Union (IoU) metric, the proposed method incorporates a fusion module at the end of the neural network (see
Figure 1), which is based on texture segmentation and noise level analysis applied to the mask suggested by the neural network. Additionally, we propose a new dataset called the Real Inpainting Detection Dataset, designed to address the existing limitations in current datasets. In this dataset, the inpainted objects (removed objects) are based on real mask objects sourced from the Google Open Images dataset, rather than randomly generated masks that could potentially overlap with multiple objects in an image. The dataset incorporates several distinct inpainting methods, each applied to the same images and includes a variety of backgrounds and architectural approaches (such as GANs, diffusion models, transformers, etc.) to facilitate a more accurate validation of the detection method.
2. Related Work
Image forensics is a discipline dedicated to detecting and analyzing digital image manipulation. It involves techniques and algorithms to determine images' authenticity, origin, and integrity. The primary goal is to expose forgeries, identify image sources, and provide evidence in legal investigations. Based on the classification done by the authors [
6,
7], image forgery detection can be classified based on traces: either the forgery operation specific traces like copy-move, splicing, inpainting or by optical camera traces (like blur, noise, chromatic aberration, etc.). From a detection standpoint, inpainting forgery detection has not been extensively examined. It has traditionally been regarded as a subset of copy-move forgery detection. Copy-move forgery detection (CMFD) is a well-established digital image forensic technique known for its capability to detect altered regions in multimedia content. Researchers have developed numerous CMFD algorithms, drawing on either traditional digital image processing (DIP) and feature-based approaches or leveraging deep learning techniques. In the following sub-section, we will review CMFD methods applicable to image inpainting, as well as a few specific inpainting methods.
In block-based methods, the image gets divided into non-overlapping, overlapping, or partially overlapping blocks of equal sizes. Then, some feature extraction methods are employed, and lastly, the feature is sorted through lexicography. For feature extraction, several methods have been employed, like using DCT [
8] or DWT [
9]. The method based on DWT applies image decomposition and then further processes only the LL part, thus ignoring high-level details. Some authors even tried to combine these techniques [
10]. All the above methods have their strengths, but the main disadvantage relies either on the inability to manage cases like resizing and rotating the copy object or on the block-related decisions, such as size, overlapping or non-overlapping, dimensionality reduction, etc. To overcome these limitations, authors have suggested an approach based on key points extraction like SURF [
11] or SIFT [
12]. From an analysis performed in [
13] and a well-crafted dataset, the method with the best results is based on Zernike moments [
14]. In general, so-called key-points method performance is better than block based, but they lack the ability to manage cases when real objects of the same textures are repeated naturally inside the image [
15]. For older inpainting patch-based methods like the one proposed by Criminisi [
16], some of the above methods yield decent results, but with newer patch-based methods and deep learning methods, the above methods also lack the inability to handle image inpainting detection.
In terms of newer, neural networks-based forgery detection methods, one of the most cited papers is [
17]. The authors’ approach is a universal forgery detection mechanism, which combines a feature extractor with an anomaly detection module. The feature extractor learns several types of patterns introduced in forgery processing operations (like blur inconsistencies, color/texture inconsistencies, etc.). The problem with their approach is that newer methods like diffusion methods or transformers have hugely different artifacts introduced. Thus, it is always necessary to retrain the feature extractor with newer methods. Specifically, for inpainting detection methods, in [
18], the authors proposed a neural network called IID-NET like the Mantranet one. IID-Net employs Neural Architecture Search (NAS) to automatically discover the most effective network architecture for inpainting detection. This approach allows IID-Net to find an optimal configuration tailored to the specific task of inpainting detection. Additionally, IID-Net incorporates an attention mechanism that helps the network focus on the region most likely to be manipulated, enhancing its ability to detect subtle inpainting forgeries. To train with diverse types of inpainting methods, the authors in IID-NET also propose a new dataset, consisting of several inpainting methods applied. The problem with their dataset is that the mask to be removed (filled in) by the inpainting methods is artificially and arbitrarily created, thus forcing the inpainting methods to generate a lot of visible artifacts, therefore improving the changes of the detection method. An improvement to the IID method was proposed in [
19] called AFTLNet. The network aims to efficiently learn and detect traces left by inpainting operations, which are often subtle and difficult to identify through an adaptive learning framework. AFTLNet’s performance is highly dependent on the quality and diversity of the training data. Their proposed dataset uses small images of 256 × 256, and again, there is a problem with the generated mask of the object to be inpainted. In recent years, several improvements have been made by looking at this differently and relying only on a specific type of artifact—noise [
20]. Noisesniffer operates by analyzing the noise variance across different regions of an image. It assumes that authentic images have a consistent noise pattern, while manipulated regions will exhibit anomalous noise characteristics. The tool extracts noise features from the image and uses them to create a noise map, which highlights potential tampered areas. Noisesniffer relies heavily on the assumption that authentic images have consistent noise patterns. However, this assumption may not hold for all images, especially those captured under varying conditions or with different devices, which can introduce natural noise inconsistencies. The method can be less effective on images that have undergone heavy compression. Compression artifacts can interfere with the noise extraction process, leading to false positives or false negatives in forgery detection [
21]. Recently, Ref. [
22] aims to enhance DeepFake detection in low-quality, highly compressed content by leveraging branch architecture: Local High-Frequency Enhancement (LHiFE), Global High-Frequency Enhancement (GHiFE), and a regular Branch. The basic branch uses the standard RGB model as input, the LHife branch uses a Discrete Cosine transform to enhance local high-frequency information, and a GHife branch employs a multi-level wavelet decomposition to extract global high-frequency information. Additionally, a Two-Stage Cross-Fusion module is designed to integrate information from these branches effectively, enhancing weak high-frequency information in low-quality data. Although the method exhibits good results on the analyzed datasets, it is significantly influenced by the cross-dataset generalization problem and the extension to object removal (inpainting). The limitation of inpainting methods is due to the nature of the inpainting process, which tries to fill regions based on information presented in the image, color, and texture with the same correlation as the rest of the image. Other limitations consist of vulnerability to adversarial attacks and the fact that the method does not focus on any temporal inconsistency; rather, it focuses only on spatial information. Lastly, the method is employed on rather small images (256 × 256), and this impacts object removal for real-world images: in smaller images, the edges and transitions may appear less sharp due to limited pixel information, thus producing blur artifacts, texture loss, and high-frequency information loss. Like the previous study, the work in [
23] proposes a similar approach. However, the key differences lie in its use of a combined FFT and Wavelet method for feature extraction and the integration of a 3D CNN with LSTM to capture spatial and temporal consistencies in deepfake detection.
To address all the limitations above-mentioned, in the following paper, a novel based approach is proposed by incorporating scatternets first proposed by Mallat in [
24] and the Dual-tree complex wavelet first proposed by Kingsbury in [
25]. Wavelet transforms have been widely used in image-processing tasks such as compression, denoising, and feature extraction. Wavelet scattering proposed by Mallat provides a robust representation of signals and images that is stable to deformations, such as translations and small rotations, making it particularly useful for tasks like classification and pattern recognition. The Dual-Tree Complex Wavelet Transform (DT-CWT), introduced by Kingsbury, offers advantages over the traditional Discrete Wavelet Transform (DWT) by providing nearly shift-invariant and directionally selective filters. These properties make DT-CWT particularly useful in detecting fine details and preserving edge information, which are crucial in inpainting detection. Using the above-mentioned methods, only two researchers have used the first one [
26], in which the authors tried to use the artifacts for face forgery detection, and the latter one [
27], which uses the same block-based approach but relies on the DTCWT for feature extractions.
3. Proposed Method
Our proposed method uses a Scatternet encoder/decoder architecture to learn high-level feature inconsistencies and generate an initial mask. To enhance the detection process, an additional module is incorporated, which takes the original image and applies color segmentation. For each segmented region, the module determines the extent of overlap with the mask produced by the scatternet. Segments that either do not overlap or fully overlap with the mask are disregarded. However, if a segment partially overlaps with the mask, a noise distribution is calculated for that segment. Only irregularities in the noise distribution at the block level within these partially overlapping segments are marked as potential forgeries. A detailed procedure is presented in
Figure 2.
3.1. Enhanced Wavelet Scattering Network
Our inspiration is drawn from recent studies in the use of complex wavelets in neural networks [
28]. The first novelty of the proposed method for image inpainting detection is the use of scatternets for feature extraction—more specifically, an enhanced variant of the learnable scattering layer proposed in [
28]. Wavelet Scatternet [
24], a powerful tool for feature extraction in image forgery detection, operates on the principles of wavelet transformation and scattering. The method involves cascading wavelet transformations and modulus non-linearities to capture local frequency through wavelet coefficients, providing a comprehensive representation of image features. This approach is particularly effective in capturing various informational scales in signature images, from global components to finer details and high-frequency components. The mathematical foundation of the wavelet scatternet involves the following main operations: Wavelet Transform is a convolution with a set of wavelet filters at different scales and orientations; Modulus operation, which captures the amplitude of the signal, making the representation non-linear and thus more robust to small variations in the input; Averaging (Smoothing) thus reducing the spatial resolution while maintaining essential information, enhancing the stability of the features; Nonlinearity applied to enhance discriminative information and last step is the Pooling in which features are pooled to reduce dimensionality and improve generalization. For an input image
, the wavelet transforms at scale
and orientation
is given by
where
is the wavelet filter. The modulus and averaging operations are then applied to form the scatternet representation:
where
is a low-pass filter. After applying the wavelet transform, the modulus operation is performed to introduce non-linearity:
This operation ensures that the magnitude of the wavelet coefficients is considered, which helps emphasize texture discontinuities and anomalies that may arise from forgery. Non-linearity plays a crucial role in capturing complex interactions between different frequency components. To achieve translation invariance, the modulus of the wavelet coefficients is smoothed by applying a low-pass filter, typically a Gaussian filter:
where
is the low-pass filter. This step ensures that small translations in the image do not significantly alter the feature representation, providing robustness to misaligned forged regions. The averaging step provides translation invariance, ensuring that small shifts or displacements in forged regions do not result in substantial changes in the scatternet representation. Mathematically, for small translations
, we have (this robustness is crucial in cases where forged regions may be slightly misaligned):
A special type of wavelet scattering was proposed by the authors in [
29]. The Dual-Tree Complex Wavelet Transform (DTCWT) extends the traditional Discrete Wavelet Transform (DWT) by using complex-valued wavelets. It involves two parallel wavelet trees (real and imaginary) that form a complex representation. The complex wavelet coefficients are formed as:
The magnitude of these coefficients is used to capture texture features:
The main benefits of using DTCWT for capturing texture patterns can be summarized: shift invariance—approximately shift-invariant, making it robust for texture recognition, reduced redundancy—more efficient representation compared to other complex wavelet transforms and speed, as the results presented further in this paper will demonstrate. The use of scatternets for image forgery detection has not been tried until now; before, there were only a couple of papers that used DTCWT [
26] for deepfake detection or the copy–move approach [
30]. The authors in [
30] use the Dual-Tree only for feature extraction, and their interest is only in the low-level subband, thus ignoring valuable information is from other subbands.
Our proposed encoder architecture presents several key differences from the approach outlined in [
28]—see
Figure 3. Firstly, the low-level band information is excluded, as our network is specifically designed to identify inconsistencies within higher components of the wavelet subband decomposition. This focus allows the model to concentrate on the high-frequency components that are more likely to reveal subtle manipulations or discrepancies. For each individual channel, given that
n levels of decomposition yield a total of 6 ×
n channels per input channel, an EfficientNetV2-like encoder architecture in [
31] is employed. The choice of EfficientNetV2 is driven by its efficient scaling and superior performance in extracting meaningful features while maintaining computational efficiency. This architecture’s ability to balance depth, width, and resolution ensures that the encoded features are both rich and computationally feasible. The importance of the encoder's design is crucial because it transforms the input into a hierarchy of feature maps, capturing fine-to-coarse details from each channel. This is critical for detecting subtle manipulations like inpainting, where local inconsistencies may vary across different orientations. The multi-encoder setup allows the model to effectively capture a broad range of structural and textural information, as each encoder specializes in a different aspect of the wavelet coefficients. This diversity of features is essential for identifying alterations that standard CNN architectures may overlook.
Subsequently, the outputs from each encoded channel are integrated into a decoder, structured similarly to a UNet++ architecture [
32]. The decoding process within the model leverages the U-Net++ architecture, which employs densely connected skip pathways that connect intermediate encoder outputs to corresponding stages within the decoder. This design significantly enhances the flow of information and facilitates feature reuse across different levels of the network, thereby improving the model’s ability to capture both local and global contexts. The U-Net++ decoder comprises nested decoding blocks, beginning with the deepest feature maps, characterized by low spatial resolution and high semantic content. Each block in the decoder progressively up-samples the feature maps while integrating information from the corresponding encoder stages via skip connections. This integration of multi-scale features allows the network to refine its segmentation outputs, enabling it to accurately delineate inpainted regions. The nested structure of the U-Net++ decoder provides a robust mechanism for capturing fine details and contextual information simultaneously, which is particularly crucial for distinguishing natural textures from artificially inpainted areas. By combining dense connectivity with multi-scale feature extraction, the decoder effectively reconstructs a detailed understanding of the image, facilitating the detection of subtle alterations. This architectural approach ensures that both high-level semantic features and low-level spatial features contribute to the final segmentation output, enhancing the model’s sensitivity to inconsistencies indicative of inpainting.
The final stage of the model involves a segmentation head that processes the outputs from the U-Net++ decoder. This component consists of a convolutional layer followed by a sigmoid activation function, which transforms the multi-channel decoder output into a single-channel segmentation mask. The sigmoid activation constrains the pixel values to a range between 0 and 1, representing the probability of each pixel being part of an inpainted region. This probabilistic approach to segmentation allows the model to express varying degrees of confidence regarding the likelihood of inpainting at each pixel location. The generated segmentation mask serves as a direct visual representation of the model’s predictions, highlighting areas that are deemed to be altered. This output is crucial for inpainting detection, as it provides a pixel-level identification of manipulated regions, allowing for a detailed and interpretable assessment of the image’s authenticity. The ability to produce precise segmentation masks is instrumental in forensic applications, where identifying even subtle inpainting artifacts can be critical.
The model’s output is a probabilistic segmentation mask that identifies areas within the image suspected of being inpainted. Each pixel value in the mask represents the model’s confidence in whether that specific pixel has been artificially altered. By integrating multi-scale feature extraction with dense connections throughout the decoding process, the model can detect nuanced artifacts commonly introduced by inpainting techniques, such as unnatural transitions, blurred edges, or repetitive patterns that deviate from the original texture of the image. Incorporating DTCWT coefficients enhances the model’s ability to detect structural anomalies, providing it with a unique sensitivity to frequency-domain features often indicative of inpainting. This approach allows the model to effectively distinguish between naturally occurring textures and those that have been artificially manipulated, offering a sophisticated and reliable tool for detecting image alterations. By leveraging advanced convolutional networks combined with wavelet-based feature extraction, the model demonstrates a high degree of accuracy in identifying inpainted regions, underscoring its potential applicability in forensic and image authentication contexts.
3.2. Adaptive Noise-Aware Texture Inconsistencies
To improve the accuracy of the proposed network architecture above, an extract module is added to increase the accuracy. Our proposed method for inpainting detection assumes that when inpainting is applied to an image, the removed object is typically enclosed within a larger area of similar texture (see
Figure 4). This assumption remains valid even when the object is located at the boundary between two distinct textured regions, as these regions can be analyzed independently (see
Figure 5). The method also relies on the observation that inpainted regions do not possess the same high-level features as the surrounding areas with similar textures.
The proposed module, called Adaptive noise-aware texture inconsistencies, is performed via Hierarchical Feature Selection [
33] and DTCWT combined with noise level estimation. Our paper introduces an advanced methodology for detecting inconsistencies in segmented color regions of an image. The process begins with Hierarchical Feature Selection for color segmentation, ensuring precise identification of regions based on color and texture characteristics -as one can notice in
Figure 6. For each segmented color region, the module extracts the Dual-Tree Complex Wavelet Transform (DTCWT) coefficients, focusing on a one-level decomposition that results in six complex bands. These coefficients are specifically extracted from the positions indicated by the segmented color regions (see
Figure 6; in the first row, the original image and the mask of the object to be removed, and on the second row, the inpainted image and the texture segmentation obtained for the inpainted image).
Following the segmentation, the module divides each region into smaller patches, where noise estimation is performed. The noise estimation technique is based on the method outlined in [
34], which effectively estimates noise levels by analyzing image gradients, demonstrating both accuracy and efficiency across diverse image types. This method is particularly beneficial for tasks necessitating precise noise quantification. In our approach, we extend this concept by applying noise estimation to individual patches within color-segmented regions, rather than the entire image. This localized analysis allows us to detect discrepancies in noise levels within each segment. Specifically, by calculating noise levels for each patch and identifying regions with high noise variance, we flag these as suspect areas (see below image—output of applying DTCWT to the inpainted image—
Figure 7). In
Figure 8, the first image represents the output of texture segmentation, the second image represents the first segment/area to be analyzed (our proposed method analyzes all textures with a size threshold given), and the last image represents the patches together with their noise estimation from DTCWT. This targeted method enables more precise detection of noise inconsistencies, often indicative of image forgery or tampering. A more zoomed-in result version can be seen in
Figure 9.
The last step in the module’s process involves comparing the suspected regions with a potential mask generated from a neural network. If the areas of high noise variance overlap with the mask, these regions are conclusively marked as forged, ensuring a robust detection of tampered areas. Conversely, regions without noise inconsistency are disregarded, thereby refining the final mask to only areas with confirmed inconsistencies. This method enhances the reliability of color segmentation and noise detection in complex image analysis tasks.
By combining Hierarchical Feature Selection (HFS) for initial color segmentation with the Dual-Tree Complex Wavelet Transform (DTCWT) and noise estimation per region, the approach helps in capturing inconsistencies accurately, likely yielding promising results. Here’s a breakdown of why this method is well-conceived and why it should be effective:
Texture/color segmentation: Precise segmentation is essential because inconsistencies and noise patterns are better detected when analysis is confined to homogenous regions rather than applied to the whole image. The focus on texture and color provides a stable basis for noise estimation in each region, leading to a reliable analysis of inconsistencies.
Complex wavelets decomposition: Because DTCWT provides both real and imaginary components, it can better represent subtle textures and structural details in each region. Analyzing these coefficients allows the model to capture local inconsistencies that may not be evident through raw pixel values. This is especially relevant in detecting small-scale texture discrepancies that signify inconsistencies.
Noise estimation per region in wavelet domain: Noise patterns vary across regions, especially in images with diverse textures and colors. Estimating noise locally for each patch (rather than globally) accommodates these variations, making it easier to detect unnatural inconsistencies in textures and patterns. By analyzing wavelet coefficients rather than raw pixels, the noise estimation becomes sensitive to structural anomalies that could indicate potential inconsistencies.
To summarize the core idea of the proposed method, is that on a given texture, wavelet coefficients are expected to follow a certain statistical distribution, often a standard distribution, particularly in high-level, noise-free regions. Deviations from this expected distribution suggest potential anomalies or inconsistencies. This expectation aligns with the general assumption that in a coherent texture, the response of wavelet coefficients should remain relatively consistent (following a near-normal distribution due to the central limit theorem in statistical analysis of natural images). When the coefficients deviate significantly from this expected distribution, it flags possible inconsistencies—suggesting either noise or texture anomalies that deviate from the original structure. This approach ensures that inconsistencies are not merely variations in pixel color but statistically significant deviations within the texture’s structure. While the method appears robust, there could be challenges with handling images that have regions with overlapping textures or very subtle inconsistencies that may not significantly affect wavelet coefficient distributions.
4. Results
4.1. Real Inpainting Detection Dataset
Building on our previous research in [
3], it is evident that existing inpainting detection datasets are insufficient for real-world applications. These datasets often lack critical attributes necessary for accurately identifying inpainted regions (e.g., removed objects). Even in recent works like [
18,
19], the proposed datasets suffer from inconsistencies where the inpainted regions do not align well with the original image content. To improve the effectiveness of inpainting methods, we propose focusing on clearly defined objects or regions. For this reason, we employ Google Open V7 dataset [
35] and for the segmented objects we rely on the [
36].
To evaluate the effectiveness of our proposed method, we applied it alongside several inpainting techniques. Specifically, for each image in the Google Open Images dataset, we selected a segmented object and applied three different inpainting methods. This process resulted in a total of 12,000 forged images from an initial input of 4000 images. All original images, masks and inpainted mages are uploaded to
https://github.com/jmaba/Deep-dual-tree-complex-neural-network-for-image-inpainting-detection (accessed on 10 November 2024). The inpainted methods are taken to be from different approaches like Fourier based [
37] or newer methods based on transformers like [
38,
39].
4.2. Experimental Setup
During the training process, 6k of the images were randomly selected from each inpainting method. A separate set of 1500 images was used for validation, while the rest were used for testing. To ensure a robust evaluation of the model, we implemented a k-fold cross-validation strategy, allowing us to assess the model’s performance across various data subsets. The model was trained for 30 epochs, providing enough iterations for optimization convergence while minimizing the risk of overfitting. This method allowed for a thorough assessment of the model’s ability to generalize to unseen data. The AdamW optimization algorithm was employed with a learning rate set to 1 × 10−5, while the weight decay was set to 1 × 10−4.
4.3. Implementation Details
4.3.1. Enhanced Wavelet Scattering Network Module
For the wavelet scattering network, two methods have been employed: first option is the standard Morlet wavelet scattering proposed by Mallat and improved in [
40], and the second option is the Dual-Tree complex wavelet scattering in [
28]. For the first wavelet scattering, the transform used in this study applies two successive wavelet transforms, each followed by a modulus non-linearity, utilizing eight different angles for the wavelet transform. For the latter, which is ten times faster than the former, we use a second order wavelet scattering network. To determine which wavelet scattering yields the best results, the same training and validation was done for 20 epochs. From the table below it can be observed that indeed the Morlet based approach gives better results but not with a significant difference compared to the Cotter method and comparing training/testing time the Cotter approach is ten times faster. Based on these results, the used scattering method is the one proposed by Cotter. For the EfficientNetV2 model architecture, the EfficientNetV2-S encoder is applied to each wavelet scattering channel band. The decoder utilizes an architecture like U-Net++ for each individual band, which is then followed by a segmentation head. A unique feature of the proposed model is the incorporation of global average pooling to extract global contextual information from the decoder output. This global context is processed through a fully connected layer to produce a feature that summarizes the entire image’s content. The global feature is then merged with the high-resolution output from the decoder via a custom convolutional layer. This fusion of local and global information enhances the model’s ability to detect inpainting across different scales.
4.3.2. Adaptive Noise-Aware Texture Module
The segmentation result from the network is passed into a noise-aware texture module. Initially, color segmentation is performed on the input image using the method from [
33] with SLIC set to 32. Segmented regions smaller than 0.1 of the original image size are discarded. The segmented image is then patched using an eight-by-eight grid. For noise variance analysis, the original six subbands (magnitude values) are used. If the standard deviation between patches exceeds a certain threshold, those patches are flagged as suspicious. Only areas where suspicious regions overlap with the segmentation results from the previous network module are marked as forged.
4.4. State-of-the-Art Analysis Comparison
Like previous works, we evaluate performance using IoU, F1, Precision, Recall, and Accuracy at pixel level. Results were reported using the default threshold of 0.5. For image-level analysis, we focus on balanced accuracy, which considers both false positives and false negatives, with the threshold set to 0.5. The proposed method was evaluated using an extended version of the publicly available inpainting forgery detection dataset. The results demonstrate the effectiveness of the proposed method in detecting tampered regions across a wider variety of image content and inpainting techniques. For demonstration purposes, we selected several images, including the originals, masks, and inpainting outcomes. We applied the detection methods proposed in [
18], referred to as IID; Ref. [
41], referred to as PSCCNET; Ref. [
42], referred to as FOCAL and TruFor referred as [
43]. For all models, the pretrained networks available were used. For [
19] (a variant of [
18]), we have tried training it on our dataset, but it yields not so reliable results during training. We hypothesize that this issue arises because the network requires input images of a fixed size (256 × 256). Consequently, the resizing operation at the beginning of the training process may distort critical features in the images, preventing the network from effectively learning. In
Figure 10 the original image, mask of the removed object and the inpainted image are presented. The inpainted image is the input to detection methods. In
Figure 11 the results of inpainting detection are presented—our method with comparison to state-of-the-art methods.
In
Figure 12 the metrics for each detection method for each dataset are presented. The evaluation of various methods (FOCAL, IID, PSCCNET, TruFor) against our proposed method in the task of image inpainting detection highlights the superiority of our approach across key metrics: Accuracy, IoU (Intersection over Union), and Precision. Our proposed method consistently demonstrates the highest performance across all metrics, indicating its robustness in accurately identifying inpainted regions. Compared to other methods, it shows a marked improvement in Accuracy, reflecting superior reliability in detection. The significantly higher IoU values highlight its exceptional ability to precisely localize inpainted areas, which is crucial for detecting subtle modifications in images. Additionally, the outstanding Precision of our method—far exceeding that of FOCAL, IID, and PSCCNET—underscores its effectiveness in minimizing false positives, a critical factor for practical application. Overall, our proposed method outperforms existing approaches, establishing itself as a leading solution for image inpainting detection. Its consistent superiority across Accuracy, IoU, and Precision metrics suggests that it provides more reliable, precise, and confident detection of inpainting artifacts. These results position our method as highly effective for applications in image forensics, content verification, and quality assessment, reinforcing its potential as a state-of-the-art tool in the field.
4.5. Ablation Study
To evaluate the effectiveness of the proposed method, we’ve focused on training only on images from one dataset and testing on the others. As shown in
Table 1, the IoU was suboptimal, indicating that training exclusively on the LAMA dataset does not yield satisfactory results. This finding suggests that the network may need to be specifically trained for each category of image inpainting method. Additionally, it highlights that even in the absence of visual cues in the Dual Tree Complex Wavelet coefficients, the network still learns relevant information.
The IoU is significantly higher when the model is trained on all datasets (0.82) compared to when it is trained on Lama alone and tested on other datasets (0.20 for MAT and 0.12 for ZITS). This stark reduction highlights the importance of diverse training data in learning a more generalized feature set that can effectively capture inpainted regions across different test sets. Precision drops sharply when the model is trained on Lama and tested on other datasets, with values of 0.22 for MAT and 0.16 for ZITS, compared to 0.93 when trained on all datasets. This indicates that the method struggles with false positives when trained on limited data, underscoring the poor transferability of features learned solely from the Lama dataset. Accuracy also drops sharply when the model is trained on Lama and tested on other datasets, with values of 0.42 for MAT and 0.41 for ZITS, compared to 0.95 when trained on all datasets. This indicates that the method struggles with false positives when trained on limited data, underscoring the poor transferability of features learned solely from the Lama dataset. The significant decline in performance metrics when the model is trained only on the Lama dataset and tested on MAT or ZITS highlights the model’s overfitting to the specific characteristics of the Lama dataset. The lack of diverse training data restricts the model’s ability to generalize, making it vulnerable when exposed to unseen test data with different textures, artifacts, or inpainting patterns. The results emphasize the critical need for diverse and comprehensive training data to ensure the proposed method’s robustness and effectiveness across various test conditions. Training on a wider range of datasets enables the model to capture a broader spectrum of inpainting features, resulting in significantly improved IoU, Precision, and Accuracy. These findings suggest that future work should prioritize the inclusion of varied inpainting patterns and artifacts in the training phase to enhance the model’s generalizability and performance across different application scenarios.
4.6. Post-Processing Impact of Forgery Detection
Building on the results presented above, the following sub-chapter delves deeper into the capabilities of the proposed method in detecting alterations. We focused on two common operations—image resizing and blurring. Given that the input dataset comprises images of varying sizes, we evaluated detection performance using proportional resizing. Additionally, we evaluated the impact of blurring the input image separately, and finally, we combined resizing and blurring into a single operation. The first post-processing operation analyzed is resize (see
Figure 13 and
Figure 14). The images are resized to 0.7 of their initial size. As can be noticed, the proposed method does not yield good results (see
Table 2).
The IoU drops dramatically from 0.82 to 0.13 when the images are resized, indicating that the resized images severely impair the method’s ability to correctly overlap detected inpainted regions with the true regions. This significant reduction suggests that the method struggles to maintain consistent detection performance when subjected to size alterations. Precision decreases from 0.93 to 0.76 after resizing, showing that the method becomes less reliable and introduces more false positives when identifying inpainted areas. Although precision remains relatively high, the drop indicates that resizing introduces noise or artifacts that affect detection quality. Accuracy also suffers, decreasing from 0.95 to 0.78. This decline reflects the method’s overall reduced effectiveness in distinguishing inpainted from non-inpainted regions under resizing conditions. The primary cause of this degradation is the sensitivity of the Discrete Transform Complex Wavelet Transform (DTCWT) coefficients to resizing operations. DTCWT coefficients are critical features used by the proposed method for detecting inpainted regions. However, resizing alters the spatial frequency and orientation of these coefficients, leading to misalignments and incorrect feature extraction, ultimately impairing the detection capability. The results demonstrate that the proposed method’s performance is highly susceptible to resizing operations, particularly due to the disruption of DTCWT coefficients. This highlights a significant limitation when applying the method to images that undergo resizing, emphasizing the need for robust adaptation or alternative feature extraction strategies to handle such transformations effectively. Future work should focus on enhancing the resilience of the detection algorithm to maintain performance across various post-processing operations, including resizing.
The second investigated post-processing operation is blurring. For this operation, we’ve taken the original images and applied a box blurring with a radius of 5. The visual results can be seen in
Figure 15, while the overall results can be seen in
Table 3.
The IoU decreases from 0.82 to 0.33 when blurring is applied, indicating a moderate reduction in the method’s ability to correctly identify the overlapping areas between the detected and actual inpainted regions. Although the IoU is lower for blurred images, the method retains some effectiveness in localizing inpainted regions despite the added noise from blurring. Precision remains relatively stable, with a slight decrease from 0.93 to 0.92. This suggests that blurring does not significantly impact the method’s ability to maintain a low rate of false positives. The near-constant precision indicates that the method is still confident in its positive detections, even when blurring is applied. The accuracy drops from 0.95 to 0.85, showing that blurring reduces the method’s overall effectiveness in correctly distinguishing inpainted regions from non-inpainted areas. The decreased accuracy points to challenges in correctly classifying the altered pixel values introduced by the blurring effect. Blurring, particularly box blurring, smooths the image by averaging pixel values within a given radius, which disrupts the edge and texture information critical for inpainting detection. The proposed method relies on detailed feature extraction, and blurring diminishes the distinctiveness of inpainted regions, making them harder to detect accurately. While the proposed method’s performance degrades under blurring, the impact is less severe compared to resizing. The IoU and accuracy decline, indicating challenges in detecting precise inpainted boundaries and maintaining overall detection accuracy. However, the high precision suggests that the method remains effective at minimizing false positives despite blurring. These results highlight the method’s partial robustness to blurring but also underscore the need for enhancement strategies to mitigate the effects of such image transformations. Future research should explore adaptive filtering techniques or more resilient feature extraction methods that can better withstand blurring without compromising detection quality.
4.7. Analysis on Other Datasets
Based on the results obtained above, the next step was to assess the robustness of the method on other datasets. The first dataset to be analyzed is the one proposed in [
18]. The dataset consists of two parts—one used for training/validation (based on [
44,
45]) and one used for testing. The dataset consists only of small images (256 × 256). Also, the masks were generated randomly, thus forcing the inpainting methods to generate a lot of artifacts. Drew from the ablation study to ascertain the robustness of the proposed method, a retraining of the neural network was needed. Thus, the same steps as the one performed in [
18]—training on the Places/Dresden subset and testing on the other subset of the dataset. The network was trained for 20 epochs. Additionally, due to the small size (and low quality) of the images, the Adaptive noise-aware texture module was not applied. The overall results can be seen in
Table 4, while some test image results can be seen in
Figure 16 and
Figure 17. Although in the training/validation phase, the method appears to learn and generalize the features, on the testing dataset, the method is not able to detect inpainted areas. Despite these general performance differences, there are certain test images where the model achieves over 90% across all metrics, suggesting that it can clearly distinguish some patterns effectively. This indicates that the model may be proficient at recognizing specific, distinct features in these cases, even though it struggles to generalize broadly across the dataset, or it needs more time (epochs) to learn all types of patterns.
The next analyzed dataset is the one proposed in [
46]. In this dataset, the authors took a similar approach to our proposed method: the authors took all images from the COCO dataset [
47] along with the segmented masks for each object inside each image and then applied the inpainting method described in [
48]. While their approach aligns with ours in certain ways, it lacks the authenticity of selecting objects for removal based on the surrounding context of the area as can be noticed in
Figure 18. Since the inpainting method used here differs from those in our own approach, retraining was necessary. The model shows encouraging progress over the first 20 training epochs, with consistent improvement in key metrics like IoU, precision, and F-score on the training set – as can be observed in
Table 5. By the end of this phase, the training IoU nearly doubles, and precision steadily improves, indicating the model’s growing ability to accurately capture target regions. Recall remains high from the start, suggesting that the model is effective at identifying relevant features across images. In the validation phase, however, results are more variable. While there’s some improvement over time, validation IoU and precision fluctuate, hinting at challenges in generalizing to new data. This inconsistency, compared to the training performance, suggests possible overfitting, where the model does well on known data but struggles to adapt to unseen images. In summary, the model shows promising learning on the training set but could benefit from targeted adjustments to improve reliability across diverse datasets, supporting a more balanced and adaptable performance. Some results can be noticed in
Figure 18 and
Figure 19.