Enhanced Wavelet Scattering Network for Image Inpainting Detection

Barglazan, Adrian-Alin; Brad, Remus

doi:10.3390/computation12110228

Open AccessArticle

Enhanced Wavelet Scattering Network for Image Inpainting Detection

by

Adrian-Alin Barglazan

^*

and

Remus Brad

Department of Computers and Electrical Engineering, Faculty of Engineering, University “Lucian Blaga”, 550024 Sibiu, Romania

^*

Author to whom correspondence should be addressed.

Computation 2024, 12(11), 228; https://doi.org/10.3390/computation12110228

Submission received: 10 October 2024 / Revised: 10 November 2024 / Accepted: 11 November 2024 / Published: 13 November 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid advancement of image inpainting tools, especially those aimed at removing artifacts, has made digital image manipulation alarmingly accessible. This paper proposes several innovative ideas for detecting inpainting forgeries based on a low-level noise analysis by combining Dual-Tree Complex Wavelet Transform (DT-CWT) for feature extraction with convolutional neural networks (CNN) for forged area detection and localization, and lastly by employing an innovative combination of texture segmentation with noise variance estimations. The DT-CWT offers significant advantages due to its shift-invariance, enhancing its robustness against subtle manipulations during the inpainting process. Furthermore, its directional selectivity allows for the detection of subtle artifacts introduced by inpainting within specific frequency bands and orientations. Various neural network architectures were evaluated and proposed. Lastly, we propose a fusion detection module that combines texture analysis with noise variance estimation to give the forged area. Also, to address the limitations of existing inpainting datasets, particularly their lack of clear separation between inpainted regions and removed objects—which can inadvertently favor detection—we introduced a new dataset named the Real Inpainting Detection Dataset. Our approach was benchmarked against state-of-the-art methods and demonstrated superior performance over all cited alternatives.

Keywords:

computer vision; image forensic/forgery detection; inpainting detection

1. Introduction

Image inpainting, the process of reconstructing lost or deteriorated parts of an image, is essential in fields such as image editing and restoration. Although inpainting methods have advanced significantly, as can be observed in some thorough reviews done in [1] and [2], detecting the presence of inpainting remains a critical challenge [3]. The increasing sophistication of these techniques necessitates reliable tools that can identify even subtle modifications. The proliferation of image manipulation has spurred significant growth in multimedia forensics and related disciplines. Consequently, a plethora of methods and tools have emerged for detecting and locating image forgeries. As can be seen from systematic reviews of forgery detection papers [4], while this field has expanded rapidly, current research primarily concentrates on the identification of deepfakes [5]. Although some substantial advancements have been achieved in detecting image copy-move or splicing, current state-of-the-art methods for inpainting detection remain insufficient for practical use in real-world situations. This deficiency is due to several critical challenges that are still being tackled:

Limited generalizability across various image datasets
Limited realist
The absence of a comprehensive image dataset that encompasses multiple inpainting techniques
Limitations to image variations
Insufficient detection accuracy
The fast-paced evolution of inpainting technologies

To address the above limitations, this paper proposes a novel inpainting detection framework that utilizes a wavelet scatternet based on Dual Tree Complex Wavelet in conjunction with CNN neural networks. We propose a method that combines wavelet scattering with a UNET++ architecture with an EfficientV2-like encoder. Additionally, to enhance the Intersection over Union (IoU) metric, the proposed method incorporates a fusion module at the end of the neural network (see Figure 1), which is based on texture segmentation and noise level analysis applied to the mask suggested by the neural network. Additionally, we propose a new dataset called the Real Inpainting Detection Dataset, designed to address the existing limitations in current datasets. In this dataset, the inpainted objects (removed objects) are based on real mask objects sourced from the Google Open Images dataset, rather than randomly generated masks that could potentially overlap with multiple objects in an image. The dataset incorporates several distinct inpainting methods, each applied to the same images and includes a variety of backgrounds and architectural approaches (such as GANs, diffusion models, transformers, etc.) to facilitate a more accurate validation of the detection method.

2. Related Work

Image forensics is a discipline dedicated to detecting and analyzing digital image manipulation. It involves techniques and algorithms to determine images' authenticity, origin, and integrity. The primary goal is to expose forgeries, identify image sources, and provide evidence in legal investigations. Based on the classification done by the authors [6,7], image forgery detection can be classified based on traces: either the forgery operation specific traces like copy-move, splicing, inpainting or by optical camera traces (like blur, noise, chromatic aberration, etc.). From a detection standpoint, inpainting forgery detection has not been extensively examined. It has traditionally been regarded as a subset of copy-move forgery detection. Copy-move forgery detection (CMFD) is a well-established digital image forensic technique known for its capability to detect altered regions in multimedia content. Researchers have developed numerous CMFD algorithms, drawing on either traditional digital image processing (DIP) and feature-based approaches or leveraging deep learning techniques. In the following sub-section, we will review CMFD methods applicable to image inpainting, as well as a few specific inpainting methods.

In block-based methods, the image gets divided into non-overlapping, overlapping, or partially overlapping blocks of equal sizes. Then, some feature extraction methods are employed, and lastly, the feature is sorted through lexicography. For feature extraction, several methods have been employed, like using DCT [8] or DWT [9]. The method based on DWT applies image decomposition and then further processes only the LL part, thus ignoring high-level details. Some authors even tried to combine these techniques [10]. All the above methods have their strengths, but the main disadvantage relies either on the inability to manage cases like resizing and rotating the copy object or on the block-related decisions, such as size, overlapping or non-overlapping, dimensionality reduction, etc. To overcome these limitations, authors have suggested an approach based on key points extraction like SURF [11] or SIFT [12]. From an analysis performed in [13] and a well-crafted dataset, the method with the best results is based on Zernike moments [14]. In general, so-called key-points method performance is better than block based, but they lack the ability to manage cases when real objects of the same textures are repeated naturally inside the image [15]. For older inpainting patch-based methods like the one proposed by Criminisi [16], some of the above methods yield decent results, but with newer patch-based methods and deep learning methods, the above methods also lack the inability to handle image inpainting detection.

In terms of newer, neural networks-based forgery detection methods, one of the most cited papers is [17]. The authors’ approach is a universal forgery detection mechanism, which combines a feature extractor with an anomaly detection module. The feature extractor learns several types of patterns introduced in forgery processing operations (like blur inconsistencies, color/texture inconsistencies, etc.). The problem with their approach is that newer methods like diffusion methods or transformers have hugely different artifacts introduced. Thus, it is always necessary to retrain the feature extractor with newer methods. Specifically, for inpainting detection methods, in [18], the authors proposed a neural network called IID-NET like the Mantranet one. IID-Net employs Neural Architecture Search (NAS) to automatically discover the most effective network architecture for inpainting detection. This approach allows IID-Net to find an optimal configuration tailored to the specific task of inpainting detection. Additionally, IID-Net incorporates an attention mechanism that helps the network focus on the region most likely to be manipulated, enhancing its ability to detect subtle inpainting forgeries. To train with diverse types of inpainting methods, the authors in IID-NET also propose a new dataset, consisting of several inpainting methods applied. The problem with their dataset is that the mask to be removed (filled in) by the inpainting methods is artificially and arbitrarily created, thus forcing the inpainting methods to generate a lot of visible artifacts, therefore improving the changes of the detection method. An improvement to the IID method was proposed in [19] called AFTLNet. The network aims to efficiently learn and detect traces left by inpainting operations, which are often subtle and difficult to identify through an adaptive learning framework. AFTLNet’s performance is highly dependent on the quality and diversity of the training data. Their proposed dataset uses small images of 256 × 256, and again, there is a problem with the generated mask of the object to be inpainted. In recent years, several improvements have been made by looking at this differently and relying only on a specific type of artifact—noise [20]. Noisesniffer operates by analyzing the noise variance across different regions of an image. It assumes that authentic images have a consistent noise pattern, while manipulated regions will exhibit anomalous noise characteristics. The tool extracts noise features from the image and uses them to create a noise map, which highlights potential tampered areas. Noisesniffer relies heavily on the assumption that authentic images have consistent noise patterns. However, this assumption may not hold for all images, especially those captured under varying conditions or with different devices, which can introduce natural noise inconsistencies. The method can be less effective on images that have undergone heavy compression. Compression artifacts can interfere with the noise extraction process, leading to false positives or false negatives in forgery detection [21]. Recently, Ref. [22] aims to enhance DeepFake detection in low-quality, highly compressed content by leveraging branch architecture: Local High-Frequency Enhancement (LHiFE), Global High-Frequency Enhancement (GHiFE), and a regular Branch. The basic branch uses the standard RGB model as input, the LHife branch uses a Discrete Cosine transform to enhance local high-frequency information, and a GHife branch employs a multi-level wavelet decomposition to extract global high-frequency information. Additionally, a Two-Stage Cross-Fusion module is designed to integrate information from these branches effectively, enhancing weak high-frequency information in low-quality data. Although the method exhibits good results on the analyzed datasets, it is significantly influenced by the cross-dataset generalization problem and the extension to object removal (inpainting). The limitation of inpainting methods is due to the nature of the inpainting process, which tries to fill regions based on information presented in the image, color, and texture with the same correlation as the rest of the image. Other limitations consist of vulnerability to adversarial attacks and the fact that the method does not focus on any temporal inconsistency; rather, it focuses only on spatial information. Lastly, the method is employed on rather small images (256 × 256), and this impacts object removal for real-world images: in smaller images, the edges and transitions may appear less sharp due to limited pixel information, thus producing blur artifacts, texture loss, and high-frequency information loss. Like the previous study, the work in [23] proposes a similar approach. However, the key differences lie in its use of a combined FFT and Wavelet method for feature extraction and the integration of a 3D CNN with LSTM to capture spatial and temporal consistencies in deepfake detection.

To address all the limitations above-mentioned, in the following paper, a novel based approach is proposed by incorporating scatternets first proposed by Mallat in [24] and the Dual-tree complex wavelet first proposed by Kingsbury in [25]. Wavelet transforms have been widely used in image-processing tasks such as compression, denoising, and feature extraction. Wavelet scattering proposed by Mallat provides a robust representation of signals and images that is stable to deformations, such as translations and small rotations, making it particularly useful for tasks like classification and pattern recognition. The Dual-Tree Complex Wavelet Transform (DT-CWT), introduced by Kingsbury, offers advantages over the traditional Discrete Wavelet Transform (DWT) by providing nearly shift-invariant and directionally selective filters. These properties make DT-CWT particularly useful in detecting fine details and preserving edge information, which are crucial in inpainting detection. Using the above-mentioned methods, only two researchers have used the first one [26], in which the authors tried to use the artifacts for face forgery detection, and the latter one [27], which uses the same block-based approach but relies on the DTCWT for feature extractions.

3. Proposed Method

Our proposed method uses a Scatternet encoder/decoder architecture to learn high-level feature inconsistencies and generate an initial mask. To enhance the detection process, an additional module is incorporated, which takes the original image and applies color segmentation. For each segmented region, the module determines the extent of overlap with the mask produced by the scatternet. Segments that either do not overlap or fully overlap with the mask are disregarded. However, if a segment partially overlaps with the mask, a noise distribution is calculated for that segment. Only irregularities in the noise distribution at the block level within these partially overlapping segments are marked as potential forgeries. A detailed procedure is presented in Figure 2.

3.1. Enhanced Wavelet Scattering Network

Our inspiration is drawn from recent studies in the use of complex wavelets in neural networks [28]. The first novelty of the proposed method for image inpainting detection is the use of scatternets for feature extraction—more specifically, an enhanced variant of the learnable scattering layer proposed in [28]. Wavelet Scatternet [24], a powerful tool for feature extraction in image forgery detection, operates on the principles of wavelet transformation and scattering. The method involves cascading wavelet transformations and modulus non-linearities to capture local frequency through wavelet coefficients, providing a comprehensive representation of image features. This approach is particularly effective in capturing various informational scales in signature images, from global components to finer details and high-frequency components. The mathematical foundation of the wavelet scatternet involves the following main operations: Wavelet Transform is a convolution with a set of wavelet filters at different scales and orientations; Modulus operation, which captures the amplitude of the signal, making the representation non-linear and thus more robust to small variations in the input; Averaging (Smoothing) thus reducing the spatial resolution while maintaining essential information, enhancing the stability of the features; Nonlinearity applied to enhance discriminative information and last step is the Pooling in which features are pooled to reduce dimensionality and improve generalization. For an input image

I (x)

, the wavelet transforms at scale

2^{j}

and orientation

θ

is given by

W_{j}^{θ} (I (x)) = I (x) \times ψ_{j}^{θ} (x)

(1)

where

ψ_{j}^{θ} (x)

is the wavelet filter. The modulus and averaging operations are then applied to form the scatternet representation:

S_{j}^{θ} (I (x)) = |W_{j}^{θ} (I (x))| \times ϕ_{j} (x)

(2)

where

ϕ_{j} (x)

is a low-pass filter. After applying the wavelet transform, the modulus operation is performed to introduce non-linearity:

M_{j}^{θ} (I (x)) = |W_{j}^{θ} (I (x))|

(3)

This operation ensures that the magnitude of the wavelet coefficients is considered, which helps emphasize texture discontinuities and anomalies that may arise from forgery. Non-linearity plays a crucial role in capturing complex interactions between different frequency components. To achieve translation invariance, the modulus of the wavelet coefficients is smoothed by applying a low-pass filter, typically a Gaussian filter:

T_{j}^{θ} (I (x)) = M_{j}^{θ} (I (x)) \times ϕ_{j} (x)

(4)

where

ϕ_{j} (x)

is the low-pass filter. This step ensures that small translations in the image do not significantly alter the feature representation, providing robustness to misaligned forged regions. The averaging step provides translation invariance, ensuring that small shifts or displacements in forged regions do not result in substantial changes in the scatternet representation. Mathematically, for small translations

δ

, we have (this robustness is crucial in cases where forged regions may be slightly misaligned):

S_{j}^{θ} (I (x + δ)) \approx S_{j}^{θ} (I (x))

(5)

A special type of wavelet scattering was proposed by the authors in [29]. The Dual-Tree Complex Wavelet Transform (DTCWT) extends the traditional Discrete Wavelet Transform (DWT) by using complex-valued wavelets. It involves two parallel wavelet trees (real and imaginary) that form a complex representation. The complex wavelet coefficients are formed as:

W_{c o m p l e x} (I (x)) = W_{r e a l} (I (x)) + i W_{i m a g} (I (x))

(6)

The magnitude of these coefficients is used to capture texture features:

|W_{c o m p l e x} (I (x))| = \sqrt{W_{r e a l} {(I (x))}^{2} + W_{i m a g} {(I (x))}^{2}}

(7)

The main benefits of using DTCWT for capturing texture patterns can be summarized: shift invariance—approximately shift-invariant, making it robust for texture recognition, reduced redundancy—more efficient representation compared to other complex wavelet transforms and speed, as the results presented further in this paper will demonstrate. The use of scatternets for image forgery detection has not been tried until now; before, there were only a couple of papers that used DTCWT [26] for deepfake detection or the copy–move approach [30]. The authors in [30] use the Dual-Tree only for feature extraction, and their interest is only in the low-level subband, thus ignoring valuable information is from other subbands.

Our proposed encoder architecture presents several key differences from the approach outlined in [28]—see Figure 3. Firstly, the low-level band information is excluded, as our network is specifically designed to identify inconsistencies within higher components of the wavelet subband decomposition. This focus allows the model to concentrate on the high-frequency components that are more likely to reveal subtle manipulations or discrepancies. For each individual channel, given that n levels of decomposition yield a total of 6 × n channels per input channel, an EfficientNetV2-like encoder architecture in [31] is employed. The choice of EfficientNetV2 is driven by its efficient scaling and superior performance in extracting meaningful features while maintaining computational efficiency. This architecture’s ability to balance depth, width, and resolution ensures that the encoded features are both rich and computationally feasible. The importance of the encoder's design is crucial because it transforms the input into a hierarchy of feature maps, capturing fine-to-coarse details from each channel. This is critical for detecting subtle manipulations like inpainting, where local inconsistencies may vary across different orientations. The multi-encoder setup allows the model to effectively capture a broad range of structural and textural information, as each encoder specializes in a different aspect of the wavelet coefficients. This diversity of features is essential for identifying alterations that standard CNN architectures may overlook.

Subsequently, the outputs from each encoded channel are integrated into a decoder, structured similarly to a UNet++ architecture [32]. The decoding process within the model leverages the U-Net++ architecture, which employs densely connected skip pathways that connect intermediate encoder outputs to corresponding stages within the decoder. This design significantly enhances the flow of information and facilitates feature reuse across different levels of the network, thereby improving the model’s ability to capture both local and global contexts. The U-Net++ decoder comprises nested decoding blocks, beginning with the deepest feature maps, characterized by low spatial resolution and high semantic content. Each block in the decoder progressively up-samples the feature maps while integrating information from the corresponding encoder stages via skip connections. This integration of multi-scale features allows the network to refine its segmentation outputs, enabling it to accurately delineate inpainted regions. The nested structure of the U-Net++ decoder provides a robust mechanism for capturing fine details and contextual information simultaneously, which is particularly crucial for distinguishing natural textures from artificially inpainted areas. By combining dense connectivity with multi-scale feature extraction, the decoder effectively reconstructs a detailed understanding of the image, facilitating the detection of subtle alterations. This architectural approach ensures that both high-level semantic features and low-level spatial features contribute to the final segmentation output, enhancing the model’s sensitivity to inconsistencies indicative of inpainting.

The final stage of the model involves a segmentation head that processes the outputs from the U-Net++ decoder. This component consists of a convolutional layer followed by a sigmoid activation function, which transforms the multi-channel decoder output into a single-channel segmentation mask. The sigmoid activation constrains the pixel values to a range between 0 and 1, representing the probability of each pixel being part of an inpainted region. This probabilistic approach to segmentation allows the model to express varying degrees of confidence regarding the likelihood of inpainting at each pixel location. The generated segmentation mask serves as a direct visual representation of the model’s predictions, highlighting areas that are deemed to be altered. This output is crucial for inpainting detection, as it provides a pixel-level identification of manipulated regions, allowing for a detailed and interpretable assessment of the image’s authenticity. The ability to produce precise segmentation masks is instrumental in forensic applications, where identifying even subtle inpainting artifacts can be critical.

The model’s output is a probabilistic segmentation mask that identifies areas within the image suspected of being inpainted. Each pixel value in the mask represents the model’s confidence in whether that specific pixel has been artificially altered. By integrating multi-scale feature extraction with dense connections throughout the decoding process, the model can detect nuanced artifacts commonly introduced by inpainting techniques, such as unnatural transitions, blurred edges, or repetitive patterns that deviate from the original texture of the image. Incorporating DTCWT coefficients enhances the model’s ability to detect structural anomalies, providing it with a unique sensitivity to frequency-domain features often indicative of inpainting. This approach allows the model to effectively distinguish between naturally occurring textures and those that have been artificially manipulated, offering a sophisticated and reliable tool for detecting image alterations. By leveraging advanced convolutional networks combined with wavelet-based feature extraction, the model demonstrates a high degree of accuracy in identifying inpainted regions, underscoring its potential applicability in forensic and image authentication contexts.

3.2. Adaptive Noise-Aware Texture Inconsistencies

To improve the accuracy of the proposed network architecture above, an extract module is added to increase the accuracy. Our proposed method for inpainting detection assumes that when inpainting is applied to an image, the removed object is typically enclosed within a larger area of similar texture (see Figure 4). This assumption remains valid even when the object is located at the boundary between two distinct textured regions, as these regions can be analyzed independently (see Figure 5). The method also relies on the observation that inpainted regions do not possess the same high-level features as the surrounding areas with similar textures.

The proposed module, called Adaptive noise-aware texture inconsistencies, is performed via Hierarchical Feature Selection [33] and DTCWT combined with noise level estimation. Our paper introduces an advanced methodology for detecting inconsistencies in segmented color regions of an image. The process begins with Hierarchical Feature Selection for color segmentation, ensuring precise identification of regions based on color and texture characteristics -as one can notice in Figure 6. For each segmented color region, the module extracts the Dual-Tree Complex Wavelet Transform (DTCWT) coefficients, focusing on a one-level decomposition that results in six complex bands. These coefficients are specifically extracted from the positions indicated by the segmented color regions (see Figure 6; in the first row, the original image and the mask of the object to be removed, and on the second row, the inpainted image and the texture segmentation obtained for the inpainted image).

Following the segmentation, the module divides each region into smaller patches, where noise estimation is performed. The noise estimation technique is based on the method outlined in [34], which effectively estimates noise levels by analyzing image gradients, demonstrating both accuracy and efficiency across diverse image types. This method is particularly beneficial for tasks necessitating precise noise quantification. In our approach, we extend this concept by applying noise estimation to individual patches within color-segmented regions, rather than the entire image. This localized analysis allows us to detect discrepancies in noise levels within each segment. Specifically, by calculating noise levels for each patch and identifying regions with high noise variance, we flag these as suspect areas (see below image—output of applying DTCWT to the inpainted image—Figure 7). In Figure 8, the first image represents the output of texture segmentation, the second image represents the first segment/area to be analyzed (our proposed method analyzes all textures with a size threshold given), and the last image represents the patches together with their noise estimation from DTCWT. This targeted method enables more precise detection of noise inconsistencies, often indicative of image forgery or tampering. A more zoomed-in result version can be seen in Figure 9.

The last step in the module’s process involves comparing the suspected regions with a potential mask generated from a neural network. If the areas of high noise variance overlap with the mask, these regions are conclusively marked as forged, ensuring a robust detection of tampered areas. Conversely, regions without noise inconsistency are disregarded, thereby refining the final mask to only areas with confirmed inconsistencies. This method enhances the reliability of color segmentation and noise detection in complex image analysis tasks.

By combining Hierarchical Feature Selection (HFS) for initial color segmentation with the Dual-Tree Complex Wavelet Transform (DTCWT) and noise estimation per region, the approach helps in capturing inconsistencies accurately, likely yielding promising results. Here’s a breakdown of why this method is well-conceived and why it should be effective:

Texture/color segmentation: Precise segmentation is essential because inconsistencies and noise patterns are better detected when analysis is confined to homogenous regions rather than applied to the whole image. The focus on texture and color provides a stable basis for noise estimation in each region, leading to a reliable analysis of inconsistencies.
Complex wavelets decomposition: Because DTCWT provides both real and imaginary components, it can better represent subtle textures and structural details in each region. Analyzing these coefficients allows the model to capture local inconsistencies that may not be evident through raw pixel values. This is especially relevant in detecting small-scale texture discrepancies that signify inconsistencies.
Noise estimation per region in wavelet domain: Noise patterns vary across regions, especially in images with diverse textures and colors. Estimating noise locally for each patch (rather than globally) accommodates these variations, making it easier to detect unnatural inconsistencies in textures and patterns. By analyzing wavelet coefficients rather than raw pixels, the noise estimation becomes sensitive to structural anomalies that could indicate potential inconsistencies.

To summarize the core idea of the proposed method, is that on a given texture, wavelet coefficients are expected to follow a certain statistical distribution, often a standard distribution, particularly in high-level, noise-free regions. Deviations from this expected distribution suggest potential anomalies or inconsistencies. This expectation aligns with the general assumption that in a coherent texture, the response of wavelet coefficients should remain relatively consistent (following a near-normal distribution due to the central limit theorem in statistical analysis of natural images). When the coefficients deviate significantly from this expected distribution, it flags possible inconsistencies—suggesting either noise or texture anomalies that deviate from the original structure. This approach ensures that inconsistencies are not merely variations in pixel color but statistically significant deviations within the texture’s structure. While the method appears robust, there could be challenges with handling images that have regions with overlapping textures or very subtle inconsistencies that may not significantly affect wavelet coefficient distributions.

4. Results

4.1. Real Inpainting Detection Dataset

Building on our previous research in [3], it is evident that existing inpainting detection datasets are insufficient for real-world applications. These datasets often lack critical attributes necessary for accurately identifying inpainted regions (e.g., removed objects). Even in recent works like [18,19], the proposed datasets suffer from inconsistencies where the inpainted regions do not align well with the original image content. To improve the effectiveness of inpainting methods, we propose focusing on clearly defined objects or regions. For this reason, we employ Google Open V7 dataset [35] and for the segmented objects we rely on the [36].

To evaluate the effectiveness of our proposed method, we applied it alongside several inpainting techniques. Specifically, for each image in the Google Open Images dataset, we selected a segmented object and applied three different inpainting methods. This process resulted in a total of 12,000 forged images from an initial input of 4000 images. All original images, masks and inpainted mages are uploaded to https://github.com/jmaba/Deep-dual-tree-complex-neural-network-for-image-inpainting-detection (accessed on 10 November 2024). The inpainted methods are taken to be from different approaches like Fourier based [37] or newer methods based on transformers like [38,39].

4.2. Experimental Setup

During the training process, 6k of the images were randomly selected from each inpainting method. A separate set of 1500 images was used for validation, while the rest were used for testing. To ensure a robust evaluation of the model, we implemented a k-fold cross-validation strategy, allowing us to assess the model’s performance across various data subsets. The model was trained for 30 epochs, providing enough iterations for optimization convergence while minimizing the risk of overfitting. This method allowed for a thorough assessment of the model’s ability to generalize to unseen data. The AdamW optimization algorithm was employed with a learning rate set to 1 × 10⁻⁵, while the weight decay was set to 1 × 10⁻⁴.

4.3. Implementation Details

4.3.1. Enhanced Wavelet Scattering Network Module

For the wavelet scattering network, two methods have been employed: first option is the standard Morlet wavelet scattering proposed by Mallat and improved in [40], and the second option is the Dual-Tree complex wavelet scattering in [28]. For the first wavelet scattering, the transform used in this study applies two successive wavelet transforms, each followed by a modulus non-linearity, utilizing eight different angles for the wavelet transform. For the latter, which is ten times faster than the former, we use a second order wavelet scattering network. To determine which wavelet scattering yields the best results, the same training and validation was done for 20 epochs. From the table below it can be observed that indeed the Morlet based approach gives better results but not with a significant difference compared to the Cotter method and comparing training/testing time the Cotter approach is ten times faster. Based on these results, the used scattering method is the one proposed by Cotter. For the EfficientNetV2 model architecture, the EfficientNetV2-S encoder is applied to each wavelet scattering channel band. The decoder utilizes an architecture like U-Net++ for each individual band, which is then followed by a segmentation head. A unique feature of the proposed model is the incorporation of global average pooling to extract global contextual information from the decoder output. This global context is processed through a fully connected layer to produce a feature that summarizes the entire image’s content. The global feature is then merged with the high-resolution output from the decoder via a custom convolutional layer. This fusion of local and global information enhances the model’s ability to detect inpainting across different scales.

4.3.2. Adaptive Noise-Aware Texture Module

The segmentation result from the network is passed into a noise-aware texture module. Initially, color segmentation is performed on the input image using the method from [33] with SLIC set to 32. Segmented regions smaller than 0.1 of the original image size are discarded. The segmented image is then patched using an eight-by-eight grid. For noise variance analysis, the original six subbands (magnitude values) are used. If the standard deviation between patches exceeds a certain threshold, those patches are flagged as suspicious. Only areas where suspicious regions overlap with the segmentation results from the previous network module are marked as forged.

4.4. State-of-the-Art Analysis Comparison

Like previous works, we evaluate performance using IoU, F1, Precision, Recall, and Accuracy at pixel level. Results were reported using the default threshold of 0.5. For image-level analysis, we focus on balanced accuracy, which considers both false positives and false negatives, with the threshold set to 0.5. The proposed method was evaluated using an extended version of the publicly available inpainting forgery detection dataset. The results demonstrate the effectiveness of the proposed method in detecting tampered regions across a wider variety of image content and inpainting techniques. For demonstration purposes, we selected several images, including the originals, masks, and inpainting outcomes. We applied the detection methods proposed in [18], referred to as IID; Ref. [41], referred to as PSCCNET; Ref. [42], referred to as FOCAL and TruFor referred as [43]. For all models, the pretrained networks available were used. For [19] (a variant of [18]), we have tried training it on our dataset, but it yields not so reliable results during training. We hypothesize that this issue arises because the network requires input images of a fixed size (256 × 256). Consequently, the resizing operation at the beginning of the training process may distort critical features in the images, preventing the network from effectively learning. In Figure 10 the original image, mask of the removed object and the inpainted image are presented. The inpainted image is the input to detection methods. In Figure 11 the results of inpainting detection are presented—our method with comparison to state-of-the-art methods.

In Figure 12 the metrics for each detection method for each dataset are presented. The evaluation of various methods (FOCAL, IID, PSCCNET, TruFor) against our proposed method in the task of image inpainting detection highlights the superiority of our approach across key metrics: Accuracy, IoU (Intersection over Union), and Precision. Our proposed method consistently demonstrates the highest performance across all metrics, indicating its robustness in accurately identifying inpainted regions. Compared to other methods, it shows a marked improvement in Accuracy, reflecting superior reliability in detection. The significantly higher IoU values highlight its exceptional ability to precisely localize inpainted areas, which is crucial for detecting subtle modifications in images. Additionally, the outstanding Precision of our method—far exceeding that of FOCAL, IID, and PSCCNET—underscores its effectiveness in minimizing false positives, a critical factor for practical application. Overall, our proposed method outperforms existing approaches, establishing itself as a leading solution for image inpainting detection. Its consistent superiority across Accuracy, IoU, and Precision metrics suggests that it provides more reliable, precise, and confident detection of inpainting artifacts. These results position our method as highly effective for applications in image forensics, content verification, and quality assessment, reinforcing its potential as a state-of-the-art tool in the field.

Additionally, Appendix A presents several other results to better highlight the results obtained (see Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10).

4.5. Ablation Study

To evaluate the effectiveness of the proposed method, we’ve focused on training only on images from one dataset and testing on the others. As shown in Table 1, the IoU was suboptimal, indicating that training exclusively on the LAMA dataset does not yield satisfactory results. This finding suggests that the network may need to be specifically trained for each category of image inpainting method. Additionally, it highlights that even in the absence of visual cues in the Dual Tree Complex Wavelet coefficients, the network still learns relevant information.

The IoU is significantly higher when the model is trained on all datasets (0.82) compared to when it is trained on Lama alone and tested on other datasets (0.20 for MAT and 0.12 for ZITS). This stark reduction highlights the importance of diverse training data in learning a more generalized feature set that can effectively capture inpainted regions across different test sets. Precision drops sharply when the model is trained on Lama and tested on other datasets, with values of 0.22 for MAT and 0.16 for ZITS, compared to 0.93 when trained on all datasets. This indicates that the method struggles with false positives when trained on limited data, underscoring the poor transferability of features learned solely from the Lama dataset. Accuracy also drops sharply when the model is trained on Lama and tested on other datasets, with values of 0.42 for MAT and 0.41 for ZITS, compared to 0.95 when trained on all datasets. This indicates that the method struggles with false positives when trained on limited data, underscoring the poor transferability of features learned solely from the Lama dataset. The significant decline in performance metrics when the model is trained only on the Lama dataset and tested on MAT or ZITS highlights the model’s overfitting to the specific characteristics of the Lama dataset. The lack of diverse training data restricts the model’s ability to generalize, making it vulnerable when exposed to unseen test data with different textures, artifacts, or inpainting patterns. The results emphasize the critical need for diverse and comprehensive training data to ensure the proposed method’s robustness and effectiveness across various test conditions. Training on a wider range of datasets enables the model to capture a broader spectrum of inpainting features, resulting in significantly improved IoU, Precision, and Accuracy. These findings suggest that future work should prioritize the inclusion of varied inpainting patterns and artifacts in the training phase to enhance the model’s generalizability and performance across different application scenarios.

4.6. Post-Processing Impact of Forgery Detection

Building on the results presented above, the following sub-chapter delves deeper into the capabilities of the proposed method in detecting alterations. We focused on two common operations—image resizing and blurring. Given that the input dataset comprises images of varying sizes, we evaluated detection performance using proportional resizing. Additionally, we evaluated the impact of blurring the input image separately, and finally, we combined resizing and blurring into a single operation. The first post-processing operation analyzed is resize (see Figure 13 and Figure 14). The images are resized to 0.7 of their initial size. As can be noticed, the proposed method does not yield good results (see Table 2).

The IoU drops dramatically from 0.82 to 0.13 when the images are resized, indicating that the resized images severely impair the method’s ability to correctly overlap detected inpainted regions with the true regions. This significant reduction suggests that the method struggles to maintain consistent detection performance when subjected to size alterations. Precision decreases from 0.93 to 0.76 after resizing, showing that the method becomes less reliable and introduces more false positives when identifying inpainted areas. Although precision remains relatively high, the drop indicates that resizing introduces noise or artifacts that affect detection quality. Accuracy also suffers, decreasing from 0.95 to 0.78. This decline reflects the method’s overall reduced effectiveness in distinguishing inpainted from non-inpainted regions under resizing conditions. The primary cause of this degradation is the sensitivity of the Discrete Transform Complex Wavelet Transform (DTCWT) coefficients to resizing operations. DTCWT coefficients are critical features used by the proposed method for detecting inpainted regions. However, resizing alters the spatial frequency and orientation of these coefficients, leading to misalignments and incorrect feature extraction, ultimately impairing the detection capability. The results demonstrate that the proposed method’s performance is highly susceptible to resizing operations, particularly due to the disruption of DTCWT coefficients. This highlights a significant limitation when applying the method to images that undergo resizing, emphasizing the need for robust adaptation or alternative feature extraction strategies to handle such transformations effectively. Future work should focus on enhancing the resilience of the detection algorithm to maintain performance across various post-processing operations, including resizing.

The second investigated post-processing operation is blurring. For this operation, we’ve taken the original images and applied a box blurring with a radius of 5. The visual results can be seen in Figure 15, while the overall results can be seen in Table 3.

The IoU decreases from 0.82 to 0.33 when blurring is applied, indicating a moderate reduction in the method’s ability to correctly identify the overlapping areas between the detected and actual inpainted regions. Although the IoU is lower for blurred images, the method retains some effectiveness in localizing inpainted regions despite the added noise from blurring. Precision remains relatively stable, with a slight decrease from 0.93 to 0.92. This suggests that blurring does not significantly impact the method’s ability to maintain a low rate of false positives. The near-constant precision indicates that the method is still confident in its positive detections, even when blurring is applied. The accuracy drops from 0.95 to 0.85, showing that blurring reduces the method’s overall effectiveness in correctly distinguishing inpainted regions from non-inpainted areas. The decreased accuracy points to challenges in correctly classifying the altered pixel values introduced by the blurring effect. Blurring, particularly box blurring, smooths the image by averaging pixel values within a given radius, which disrupts the edge and texture information critical for inpainting detection. The proposed method relies on detailed feature extraction, and blurring diminishes the distinctiveness of inpainted regions, making them harder to detect accurately. While the proposed method’s performance degrades under blurring, the impact is less severe compared to resizing. The IoU and accuracy decline, indicating challenges in detecting precise inpainted boundaries and maintaining overall detection accuracy. However, the high precision suggests that the method remains effective at minimizing false positives despite blurring. These results highlight the method’s partial robustness to blurring but also underscore the need for enhancement strategies to mitigate the effects of such image transformations. Future research should explore adaptive filtering techniques or more resilient feature extraction methods that can better withstand blurring without compromising detection quality.

4.7. Analysis on Other Datasets

Based on the results obtained above, the next step was to assess the robustness of the method on other datasets. The first dataset to be analyzed is the one proposed in [18]. The dataset consists of two parts—one used for training/validation (based on [44,45]) and one used for testing. The dataset consists only of small images (256 × 256). Also, the masks were generated randomly, thus forcing the inpainting methods to generate a lot of artifacts. Drew from the ablation study to ascertain the robustness of the proposed method, a retraining of the neural network was needed. Thus, the same steps as the one performed in [18]—training on the Places/Dresden subset and testing on the other subset of the dataset. The network was trained for 20 epochs. Additionally, due to the small size (and low quality) of the images, the Adaptive noise-aware texture module was not applied. The overall results can be seen in Table 4, while some test image results can be seen in Figure 16 and Figure 17. Although in the training/validation phase, the method appears to learn and generalize the features, on the testing dataset, the method is not able to detect inpainted areas. Despite these general performance differences, there are certain test images where the model achieves over 90% across all metrics, suggesting that it can clearly distinguish some patterns effectively. This indicates that the model may be proficient at recognizing specific, distinct features in these cases, even though it struggles to generalize broadly across the dataset, or it needs more time (epochs) to learn all types of patterns.

The next analyzed dataset is the one proposed in [46]. In this dataset, the authors took a similar approach to our proposed method: the authors took all images from the COCO dataset [47] along with the segmented masks for each object inside each image and then applied the inpainting method described in [48]. While their approach aligns with ours in certain ways, it lacks the authenticity of selecting objects for removal based on the surrounding context of the area as can be noticed in Figure 18. Since the inpainting method used here differs from those in our own approach, retraining was necessary. The model shows encouraging progress over the first 20 training epochs, with consistent improvement in key metrics like IoU, precision, and F-score on the training set – as can be observed in Table 5. By the end of this phase, the training IoU nearly doubles, and precision steadily improves, indicating the model’s growing ability to accurately capture target regions. Recall remains high from the start, suggesting that the model is effective at identifying relevant features across images. In the validation phase, however, results are more variable. While there’s some improvement over time, validation IoU and precision fluctuate, hinting at challenges in generalizing to new data. This inconsistency, compared to the training performance, suggests possible overfitting, where the model does well on known data but struggles to adapt to unseen images. In summary, the model shows promising learning on the training set but could benefit from targeted adjustments to improve reliability across diverse datasets, supporting a more balanced and adaptable performance. Some results can be noticed in Figure 18 and Figure 19.

5. Conclusions

In this paper, we present a pioneering method for inpainting detection and localization, termed Enhanced Wavelet Scattering Network, marking the first application of wavelet scattering networks for this task, distinct from their limited prior use in DeepFake detection with the Dual-Tree Complex Wavelet Transform (DTCWT). Our approach uniquely combines features extracted from a wavelet scattering network with a UNet++-inspired architecture and an innovative fusion module, significantly enhancing localization accuracy. The fusion module plays a pivotal role by detecting inconsistencies at the texture and color segmentation levels and analyzing high-level feature discrepancies captured by the dual-tree complex wavelet. A major contribution of this work is the introduction of a novel and extensible inpainting detection dataset designed to address the limitations of existing datasets. This dataset offers a unique advantage by featuring the same images with objects removed and inpainted using multiple inpainting methods. This setup not only serves as a rich benchmark for evaluating detection algorithms but also allows for a detailed analysis of how different inpainting techniques affect detection performance, providing deeper insights into the strengths and weaknesses of our proposed method. Our results demonstrate that the Enhanced Wavelet Scattering approach can accurately localize inpainting manipulations across a wide range of techniques, providing a comprehensive evaluation of different inpainting methods. However, the method’s effectiveness can be impacted by image post-processing operations. To address this challenge, we explored several strategies to manage network complexity, including training each wavelet subband separately and integrating the results through a fusion block, leading to promising enhancements in robustness. This study not only introduces a novel detection framework but also sets a new standard with a versatile and extensible dataset, offering valuable insights for future research in forgery and inpainting detection.

Author Contributions

Conceptualization, A.-A.B. and R.B.; methodology, A.-A.B. and R.B.; software, A.-A.B.; validation, A.-A.B.; formal analysis, R.B.; investigation, A.-A.B. and R.B.; resources, A.-A.B.; data curation, A.-A.B.; writing—original draft preparation, A.-A.B.; writing—review and editing, R.B.; visualization, A.-A.B.; supervision, R.B.; project administration, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Source code and data are available at the following address: https://github.com/jmaba/Deep-dual-tree-complex-neural-network-for-image-inpainting-detection (accessed on 10 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Additional images processed. In these samples, we analyzed how three distinct inpainting methods—LAMA, MAT, and ZITS—performed on two selected sample images (Figure A1 and Figure A6). These images contained masked areas indicating regions to be removed and subsequently reconstructed. By applying each of these inpainting techniques, we generated altered images, Figure A2 and Figure A7.

We evaluated the performance of three detection methods to identify these inpainted areas. We have chosen only IID and TruFor to compare with because based on the above results, they perform overall better than the rest of the methods. For the first sample, shown in Figure A3, Figure A4 and Figure A5 our method demonstrated strong results with the LAMA inpainted image, a reasonable performance with the ZITS inpainted result, but struggled significantly with the MAT inpainted version. The difficulty with the MAT image may be due to the nature of the MAT approach itself, because its transformer-based design includes a specialized attention mechanism that emphasizes longer-range dependencies across the image, which can make alterations less noticeable by blending the new content more seamlessly with the original. When IID and TruFor were applied to these images, rather than identifying the specific areas that had been modified, both detection methods consistently flagged extensive portions of the images as altered, or even labeled the entire image as corrupted.

For the second sample, the results varied—Figure A8, Figure A9 and Figure A10. Here, both our detection method and TruFor demonstrated an impressive capacity to accurately identify the inpainted areas. This success is probably due to the presence of a large distinct texture green area—where the inpainted regions contrasted with the surrounding texture, making them easier to detect. IID, however, still struggled to focus on the specific altered areas and instead marked the entire image as forged, suggesting again its over-sensitivity observed in the first sample.

Figure A1. Original image on the (left). Inpainting mask on the (right).

Figure A2. Altered images: on the (left) obtained by Lama, on the (middle) obtained by MAT, on the (right) obtained by ZITS.

Figure A3. Results of detection on LAMA image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

Figure A4. Results of detection on MAT image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

Figure A5. Results of detection on ZITS image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

Figure A6. Original image on the (left). Inpainting mask on the (right).

Figure A7. Altered images: on the (left) obtained by Lama, on the (middle) obtained by MAT, on the (right) obtained by ZITS.

Figure A8. Results of detection on LAMA image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

Figure A9. Results of detection on MAT image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

Figure A10. Results of detection on ZITS image: On the (left) side IID, on the (middle) ours and on the (right) TruFor method.

References

Jam, J.; Kendrick, C.; Walker, K.; Drouard, V.; Hsu, J.G.-S.; Yap, M.H. A comprehensive review of past and present image inpainting methods. Comput. Vis. Image Underst. 2020, 203, 103147. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, X.; Chen, W.; Yao, M.; Liu, J.; Xu, T.; Wang, Z. A Review of Image Inpainting Methods Based on Deep Learning. Appl. Sci. 2023, 13, 11189. [Google Scholar] [CrossRef]
Barglazan, A.-A.; Brad, R.; Constantinescu, C. Image Inpainting Forgery Detection: A Review. J. Imaging 2024, 10, 42. [Google Scholar] [CrossRef] [PubMed]
A Comprehensive Review of Media Forensics and Deepfake Detection Technique | IEEE Conference Publication | IEEE Xplore. Available online: https://ieeexplore.ieee.org/document/10112394 (accessed on 12 August 2024).
Kaushal, A.; Kumar, S.; Kumar, R. A review on deepfake generation and detection: Bibliometric analysis. Multimed. Tools Appl. 2024, 1–41. [Google Scholar] [CrossRef]
Farid, H. Image forgery detection. IEEE Signal Process. Mag. 2009, 26, 16–25. [Google Scholar] [CrossRef]
Korus, P. Digital image integrity—A survey of protection and verification techniques. Digit. Signal Process. 2017, 71, 1–26. [Google Scholar] [CrossRef]
Parveen, A.; Khan, Z.H.; Ahmad, S.N. Block-based copy–move image forgery detection using DCT. Iran J. Comput. Sci. 2019, 2, 89–99. [Google Scholar] [CrossRef]
Fattah, S.A.; Ullah, M.M.I.; Ahmed, M.; Ahmmed, I.; Shahnaz, C. A scheme for copy-move forgery detection in digital images based on 2D-DWT. In Proceedings of the 2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS), College Station, TX, USA, 3–6 August 2024; pp. 801–804. [Google Scholar] [CrossRef]
DWT-DCT (QCD) Based Copy-Move Image Forgery Detection | IEEE Conference Publication | IEEE Xplore. Available online: https://ieeexplore.ieee.org/abstract/document/5977368 (accessed on 17 August 2024).
Xu, B.; Wang, J.; Liu, G.; Dai, Y. Image Copy-Move Forgery Detection Based on SURF. In Proceedings of the 2010 International Conference on Multimedia Information Networking and Security, Nanjing, China, 4–6 November 2010; pp. 889–892. [Google Scholar] [CrossRef]
Du, T.; Tian, L.; Li, C. Image Copy-Move Forgery Detection Based on SIFT-BRISK. In Proceedings of the 2018 International Conference on Control, Automation and Information Sciences (ICCAIS), Hangzhou, China, 24–27 October 2018; pp. 141–145. [Google Scholar] [CrossRef]
Christlein, V.; Riess, C.; Jordan, J.; Riess, C.; Angelopoulou, E. An Evaluation of Popular Copy-Move Forgery Detection Approaches. IEEE Trans. Inf. Forensics Secur. 2012, 7, 1841–1854. [Google Scholar]
Ryu, S.J.; Lee, M.J.; Lee, H.K. Detection of copy-rotate-move forgery using zernike moments. In Information Hiding; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2010; Volume 6387, pp. 51–65. [Google Scholar] [CrossRef]
Wen, B.; Zhu, Y.; Subramanian, R.; Ng, T.-T.; Shen, X.; Winkler, S. COVERAGE—A novel database for copy-move forgery detection. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2018; pp. 161–165. [Google Scholar] [CrossRef]
Criminisi, A.; Perez, P.; Toyama, K. Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; AbdAlmageed, W.; Natarajan, P. ManTra-Net: Manipulation Tracing Network for Detection and Localization of Image Forgeries with Anomalous Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9535–9544. [Google Scholar] [CrossRef]
Wu, H.; Zhou, J. IID-Net: Image Inpainting Detection Network via Neural Architecture Search and Attention. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1172–1185. [Google Scholar] [CrossRef]
Ding, X.; Deng, Y.; Zhao, Y.; Zhu, W. AFTLNet: An efficient adaptive forgery traces learning network for deep image inpainting localization. J. Inf. Secur. Appl. 2024, 84, 103825. [Google Scholar] [CrossRef]
Gardella, M.; Muse, P.; Morel, J.-M.; Colom, M. Noisesniffer: A Fully Automatic Image Forgery Detector Based on Noise Analysis. In Proceedings of the 2021 IEEE International Workshop on Biometrics and Forensics (IWBF), Rome, Italy, 6–7 May 2021; pp. 1–6. [Google Scholar] [CrossRef]
Bammey, Q.; Nikoukhah, T.; Gardella, M.; von Gioi, R.G.; Colom, M.; Morel, J.-M. Non-Semantic Evaluation of Image Forensics Tools: Methodology and Database. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2383–2392. [Google Scholar] [CrossRef]
Gao, J.; Xia, Z.; Marcialis, G.L.; Dang, C.; Dai, J.; Feng, X. DeepFake detection based on high-frequency enhancement network for highly compressed content. Expert Syst. Appl. 2024, 249, 123732. [Google Scholar] [CrossRef]
Aymen, F.; Hussein, W. Application of spatial and Wavelet transforms for improved Deep Fake Detection. In Proceedings of the 2024 5th International Conference on Artificial Intelligence, Robotics and Control (AIRC), Cairo, Egypt, 22–24 April 2024; pp. 13–17. [Google Scholar] [CrossRef]
Bruna, J.; Mallat, S. Invariant Scattering Convolution Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1872–1886. [Google Scholar] [CrossRef] [PubMed]
Selesnick, I.; Baraniuk, R.; Kingsbury, N. The dual-tree complex wavelet transform. IEEE Signal Process. Mag. 2005, 22, 123–151. [Google Scholar] [CrossRef]
Gao, S.; Xia, M.; Yang, G. Dual-Tree Complex Wavelet Transform-Based Direction Correlation for Face Forgery Detection. Secur. Commun. Netw. 2021, 2021, 8661083. [Google Scholar] [CrossRef]
Sharma, V.; Singh, N.; Kalra, S. Robust Prediction of Copy-Move Forgeries using Dual-Tree Complex Wavelet Transform and Principal Component Analysis. In Proceedings of the 2022 8th International Conference on Signal Processing and Communication (ICSC), Noida, India, 1–3 December 2022; pp. 491–497. [Google Scholar] [CrossRef]
Cotter, F. Uses of Complex Wavelets in Deep Convolutional Neural Networks; University of Cambridge: Cambridge, MA, USA, 2020. [Google Scholar] [CrossRef]
Singh, A.; Kingsbury, N. Dual-Tree wavelet scattering network with parametric log transformation for object classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2622–2626. [Google Scholar] [CrossRef]
Wu, Y.; Deng, Y.; Duan, H.; Zhou, L. Dual tree complex wavelet transform approach to copy-rotate-move forgery detection. Sci. China Inf. Sci. 2014, 57, 1–12. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. Proc. Mach. Learn Res. 2021, 139, 10096–10106. Available online: https://arxiv.org/abs/2104.00298v3 (accessed on 24 August 2024).
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar] [CrossRef]
Cheng, M.M.; Liu, Y.; Hou, Q.; Bian, J.; Torr, P.; Hu, S.-M.; Tu, Z. HFS: Hierarchical feature selection for efficient image segmentation. In Computer Vision–ECCV 2016; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2016; Volume 9907, pp. 867–882. [Google Scholar] [CrossRef]
An Efficient Statistical Method for Image Noise Level Estimation | Semantic Scholar. Available online: https://www.semanticscholar.org/paper/An-Efficient-Statistical-Method-for-Image-Noise-Chen-Zhu/3924f6b1ab44a35370a8ac8e2e1df5d9cd526414 (accessed on 11 July 2024).
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The Open Images Dataset V4. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Benenson, R.; Popov, S.; Ferrari, V. Large-Scale Interactive Object Segmentation with Human Annotators. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11692–11701. [Google Scholar] [CrossRef]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3172–3182. [Google Scholar] [CrossRef]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. MAT: Mask-Aware Transformer for Large Hole Image Inpainting. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10748–10758. [Google Scholar] [CrossRef]
Cao, C.; Dong, Q.; Fu, Y. ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12667–12684. [Google Scholar] [CrossRef] [PubMed]
Andreux, M.; Angles, T.; Exarchakis, G.; Leonarduzzi, R.; Rochette, G.; Thiry, L.; Zarka, J.; Mallat, S.; Anden, J.; Belilovsky, E.; et al. Kymatio: Scattering Transforms in Python. J. Mach. Learn. Res. 2020, 21, 1–6. Available online: http://jmlr.org/papers/v21/19-047.html (accessed on 25 August 2024).
Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive Spatio-Channel Correlation Network for Image Manipulation Detection and Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [Google Scholar] [CrossRef]
Wu, H.; Chen, Y.; Zhou, J. Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering. arXiv 2023, arXiv:2308.09307. Available online: https://arxiv.org/abs/2308.09307v1 (accessed on 28 September 2023).
Guillaro, F.; Cozzolino, D.; Sud, A.; Dufour, N.; Verdoliva, L. TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 20606–20615. [Google Scholar] [CrossRef]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Gloe, T.; Böhme, R. The ‘Dresden Image Database’ for benchmarking digital image forensics. In Proceedings of the SAC’10: The 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010; pp. 1584–1590. [Google Scholar] [CrossRef]
Novozamsky, A.; Mahdian, B.; Saic, S. IMD2020: A Large-Scale Annotated Dataset Tailored for Detecting Manipulated Images. In Proceedings of the 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), Snowmass Village, CO, USA, 1–5 March 2020; pp. 71–80. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2018; pp. 5505–5514. [Google Scholar] [CrossRef]

Figure 1. Proposed Method–Enhanced Wavelet Scattering Network, which consists of two modules. On the left side is the wavelet scattering network module, and on the right side is the Adaptive noise-aware texture inconsistencies module.

Figure 2. Detailed description of the 2 components from our proposed method. On the left side, the Enhanced Wavelet Scattering Network is presented. On the right side, the noise-aware texture inconsistency module is presented.

Figure 3. Enhanced Wavelet Scattering Network composed of a wavelet scattering block followed by an EfficienetV2-S encoders, combined in one decoder and with a segmentation head.

Figure 4. Original image on the left. Mask image—object to be removed in the second column. Inpainted image in the third column. Texture segmentation on the last column.

Figure 5. The original image is on the left. Mask image—object to be removed in the second column. It can be observed that the removed object is at the border between three different textured areas—grass and road. The inpainted image is in the third column. Texture segmentation is shown in the last column.

Figure 6. In the first row: the first image from the left is the original image; on the right, the mask of the object to be removed (inpainted). On the second row: in the first column is the inpainted image, and the second column, the texture segmentation applied to the inpainted.

Figure 7. Output of applying DTCWT to the inpainted image.

Figure 8. The first image represents the texture segmentation. The second image represents the current mask/segment to be processed. The last image displayed on the current segmented mask is the patch noise level estimation using DTCWT.

Figure 9. Zoom in result displaying Inconsistencies in patch mean for the second band of real values of the DTCWT.

Figure 10. Original image on the (left). Mask of removed object in the (middle). Inpainted image on (right) using LAMA.

Figure 11. Results of inpainting detection for: our method, TrueFor, FOCAL, PSCCNET, IID.

Figure 12. Results of inpainting detection applied on 3 datasets: LAMA, MAT, ZITS. Datasets are generated with methods: LAMA, ZITS, MAT. Detection methods are our proposed method, IID, PSSCNET, FOCAL, IID, and TruFor.

Figure 13. On the left side, the original image. On the (middle) the object to be removed. On the right side, the inpainted image.

Figure 14. After applying image resize to the inpainted image, our proposed method detects the image on the left. On the right, it is the true mask.

Figure 15. On the left side, the result of our proposed method after blurring the input image. On the right, side the true mask.

Figure 16. Inpainted image on the left, from IID Dataset. In the middle, the truth mask. On the right, our obtained mask.

Figure 17. Inpainted image on the left from IID Dataset. On the right, the truth mask. For this image, the method was not able to clearly detect anything.

Figure 18. Inpainted image on the (left) from IMD2020 Dataset. In the (middle) the truth mask. On the (right) our obtained mask.

Figure 19. Inpainted image on the (left) from IMD2020 Dataset. In the (middle) the truth mask. On the (right) our obtained mask.

Table 1. Results of training on one dataset and testing on the other datasets.

Training Scenario	IoU	Precision	Recall
Trained on all datasets	0.82	0.93	0.95
Trained on Lama tested on MAT	0.20	0.22	0.42
Trained on Lama tested on ZITS	0.12	0.16	0.41

Table 2. Image resize impact analysis.

Size	IoU	Precision	Recall
Original size	0.82	0.93	0.95
Resized images	0.13	0.76	0.78

Table 3. Blurring impact analysis.

	IoU	Precision	Recall
Original size	0.82	0.93	0.95
Blurred images	0.33	0.92	0.85

Table 4. Training and testing results on IID Dataset.

	IoU	Precision	Recall
Training/Validation	0.69	0.70	0.96
Testing	0.36	0.48	0.67

Table 5. Training and testing results on IMD 2020 Dataset.

	IoU	Precision	Recall
Training/Validation	0.58	0.58	0.99
Testing	0.45	0.52	0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barglazan, A.-A.; Brad, R. Enhanced Wavelet Scattering Network for Image Inpainting Detection. Computation 2024, 12, 228. https://doi.org/10.3390/computation12110228

AMA Style

Barglazan A-A, Brad R. Enhanced Wavelet Scattering Network for Image Inpainting Detection. Computation. 2024; 12(11):228. https://doi.org/10.3390/computation12110228

Chicago/Turabian Style

Barglazan, Adrian-Alin, and Remus Brad. 2024. "Enhanced Wavelet Scattering Network for Image Inpainting Detection" Computation 12, no. 11: 228. https://doi.org/10.3390/computation12110228

APA Style

Barglazan, A. -A., & Brad, R. (2024). Enhanced Wavelet Scattering Network for Image Inpainting Detection. Computation, 12(11), 228. https://doi.org/10.3390/computation12110228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Wavelet Scattering Network for Image Inpainting Detection

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Enhanced Wavelet Scattering Network

3.2. Adaptive Noise-Aware Texture Inconsistencies

4. Results

4.1. Real Inpainting Detection Dataset

4.2. Experimental Setup

4.3. Implementation Details

4.3.1. Enhanced Wavelet Scattering Network Module

4.3.2. Adaptive Noise-Aware Texture Module

4.4. State-of-the-Art Analysis Comparison

4.5. Ablation Study

4.6. Post-Processing Impact of Forgery Detection

4.7. Analysis on Other Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI