Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting

Hou, Yanyang; Ma, Xiaopeng; Zhang, Junjun; Guo, Chenxian

doi:10.3390/sym16111423

Open AccessArticle

Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting

¹

School of Information Engineering, Zhengzhou University of Industrial Technology, Zhengzhou 451100, China

²

School of Information Engineering, Zhengzhou University of Technology, Zhengzhou 450044, China

³

School of Software Engineering, Henan University of Engineering, Zhengzhou 451191, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(11), 1423; https://doi.org/10.3390/sym16111423

Submission received: 27 September 2024 / Revised: 17 October 2024 / Accepted: 22 October 2024 / Published: 25 October 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

This study presents a new image inpainting model based on U-Net and incorporating the Wasserstein Generative Adversarial Network (WGAN). The model uses skip connections to connect every encoder block to the corresponding decoder block, resulting in a strictly symmetrical architecture referred to as Symmetric Connected U-Net (SC-Unet). By combining SC-Unet with a GAN, the study aims to reconstruct images more effectively and seamlessly. The traditional discriminators only differentiate the entire image as true or false. In this study, the discriminator calculated the probability of each pixel belonging to the hole and non-hole regions, which provided the generator with more gradient loss information for image inpainting. Additionally, every block of SC-Unet incorporated a Dilated Convolutional Neural Network (DCNN) to increase the receptive field of the convolutional layers. Our model also integrated Multi-Head Self-Attention (MHSA) into selected blocks to enable it to efficiently search the entire image for suitable content to fill the missing areas. This study adopts the publicly available datasets CelebA-HQ and ImageNet for evaluation. Our proposed algorithm demonstrates a 10% improvement in PSNR and a 2.94% improvement in SSIM compared to existing representative image inpainting methods in the experiment.

Keywords:

Wasserstein generative adversarial network; U-Net; dilated convolutional neural network; multi-head self-attention; image inpainting

1. Introduction

Image inpainting is a technique used to fill in missing areas of an image by using known areas information. This known information may include structural, statistical, and semantic information [1]. Traditional image inpainting methods fall into two categories: diffusion-based and patch-based. Diffusion-based methods propagate background data into the missing areas using a diffusive process typically modeled using differential operators [2]. On the other hand, patch-based methods fill in the missing regions with patches from a collection of source images that maximize patch similarity [3]. However, traditional image inpainting methods struggle to achieve semantic restoration after balancing the effects of structural and texture inpainting [4]. Recently, deep learning has become a popular research area in image inpainting. Deep learning methods have addressed the deficiencies of traditional inpainting methods and significantly improved the quality of generated results. Deep learning image inpainting methods are classified according to their network architecture, which can be divided into Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) [5]. Mardieva et al. [6] trained a three-layer deep CNN in order to establish an end-to-end mapping between low- and high-resolution images, enhancing the ability of the model to process and distill features effectively. U-Net [7] is one of the widely used deep learning methods for image inpainting based on the CNNs and the encoder–decoder model. U-Net contains three main parts: an encoder, a decoder, and skip connections. The encoder extracts the feature map from the original image, and the decoder recovers the corrupted image from the feature map transferred by the skip connections. Another widely used deep learning method is the Generative Adversarial Network (GAN), proposed by Goodfellow et al. [8] in 2014, which has shown significant improvement in the quality of generated results in image inpainting. Therefore, combining U-Net with a GAN to generate missing image content can produce impressive results.

This study enhanced the U-Net architecture by adding convolutional blocks to both the encoder and decoder. Every encoder block is connected to the corresponding decoder block through skip connections. Because of its strictly symmetric architecture, this new model is called ‘Symmetric Connect U-Net’ (SC-Unet). The main framework of this study combines SC-Unet with GAN for image inpainting. He et al. [9] have demonstrated that skip connections provide the backpropagation of gradients an alternative path across convolutional layers. Skip connections are beneficial for model convergence and significantly extend the depth of the network. Additionally, skip connections enhance the backward propagation of gradients between the encoder and the decoder, boost the learning ability of the generator, and stabilize the convergence of reconstruction loss [10].

The receptive field is one of the basic concepts in CNNs. The receptive field is defined as a measure of association between an output feature (of any layer) and the input region in CNNs. A larger receptive field can give more context information and segmentation information [11]. Consequently, the receptive field plays a vital role in image inpainting models. However, an increased number of layers in a CNN and a convolution operator with a narrow kernel size restrict the receptive field of the CNN [12]. In order to increase the receptive field of the convolutional layer, a Dilated Convolutional Neural Network (DCNN) is proposed. DCNNs are a technique that expands the convolutional kernel (input) by inserting holes between its consecutive elements. DCNN architecture enables the network to have a larger receptive field without increasing the number of parameters [13]. Additionally, skip connections can enlarge the receptive fields to enhance network representation [10].

We have incorporated DCNN and MHSA functions into SC-Unet. Our model was tested on the publicly available datasets CelebA-HQ and ImageNet with a size of 256 × 256. We performed an ablation study to validate the effectiveness of the DCNN, MHSA, SC-Unet, generator, and discriminator modules. The experimental results demonstrated that SC-Unet assists the generator in reconstructing images more semantically and seamlessly, helps the discriminator capture more gradient loss information, and achieves satisfactory results.

2. Related Work

With the rapid progress in deep convolutional neural networks, Ronneberger [7] first proposed U-Net for medical image segmentation, and now U-Net is widely used in semantic segmentation and image inpainting. Zhaoyi Yan et al. [14] proposed a particular shift-connection layer for the U-Net architecture, named Shift-Net, for filling in missing regions of any shape with sharp structures and fine-detailed textures, enhancing the explicit relation between the encoded feature in the known region and the decoded feature in the missing region. Jiahui Yu et al. [15] proposed a coarse-to-fine generative image inpainting framework. They added to this framework a contextual attention module, which could extract feature information from known background patches to generate missing patches. Wenjie Liu [16] proposed a multi-stage progressive reasoning network for mural inpainting containing global-to-local receptive fields, enriching the information of disappeared regions and giving contextually explicit embedding results. Guilin Liu et al. [17], basing their work on U-Net architecture, proposed a partial convolution layer with an automatic mask updating mechanism to achieve satisfactory image inpainting results.

The Generative Adversarial Network (GAN) architecture is known for its powerful data generation capabilities that are suitable for image inpainting tasks [18]. A GAN comprises a generator producing fake samples and a discriminator distinguishing between fake and real samples. The Context Encoder proposed by Pathak et al. [18] was the first proposed image inpainting network to combine a CNN and a GAN. Jiwoong et al. [19] also implemented GAN method for image inpainting. Iizuka et al. [20] modified the generation network structure and added a global discriminator to enhance image inpainting results. Despite its advantages, GANs have faced challenges such as unstable training, model collapse, and gradient disappearance. To address these issues and enhance GANs, scholars have proposed DCGAN [21], WGAN [22], CGAN [23], PatchGAN [24], and other variants to improve the stability and the performance of generators and discriminators. Arjovsky’s use of Wasserstein distance to train a GAN theoretically solved the instability problem in GAN training [22]. Experiments have demonstrated that the Wasserstein GAN (WGAN) effectively addresses the challenges in GAN training difficulty and model collapse. The image inpainting method based on WGAN is easier to train, and the model is more stable, leading to better results than the traditional GAN [25].

Compared to traditional convolutional neural networks, transformer-based approaches utilize Multi-Head Self-Attention (MHSA) to capture long-range correlations between image regions for image inpainting [26]. Wang et al. [27] tested a new image inpainting method for large irregular masks. They introduced a multi-stage attention module and then used a partial convolution strategy to repair the image in a rough to acceptable way. Li et al. [28] designed an RFR-Net mainly consisting of a plug-and-play recursive feature reasoning module and a knowledge-consistent attention (KCA) module that could effectively repair damaged images missing in a large area. Some researchers have developed a class of attention operators called contextual attention [29]. With the contextual attention module, they could search the entire image for appropriate content to fill the missing regions.

3. Approach

3.1. The Principle of the SC-Unet Model

The SC-Unet architecture, as shown in Figure 1, differs from the traditional U-Net by replacing max-pooling with an increased convolution kernel stride for downsampling. Each white box represents a convolutional block, while the blue indicates copied feature channels. Skip connections are denoted by the blue arrows, and the resolutions of the feature maps are located around each box. The yellow arrows represent 3 × 3 convolution layers with a stride of 2, while the black arrows signify upsampling. In Figure 1, skip connections connect the encoder block to the corresponding decoder block, resulting in a strictly symmetric architecture for SC-Unet. We use LeakyReLU (0.1) as the activation function for every convolutional block.

3.2. DCNN in Convolutional Block of SC-Unet

As shown in Figure 2, all blocks of SC-Unet consist of CNN (3 × 3) and DCNN (3 × 3) layers, except for the last four convolutional blocks, which differ from the original convolutional block of U-Net. The input data first undergo the operation of the DCNN with a 3 × 3 kernel, and the result is then combined with the original input data along the last dimension. Next, the concatenated result goes through a 3 × 3 convolutional layer to produce the final output. By incorporating a DCNN, we can enhance the receptive field of the entire neural network. Additionally, the inclusion of the DCNN does not hinder the original input information from flowing into the subsequent convolution, allowing the structure to maintain flexibility. During training, the module can determine whether to utilize the raw input data or the information passed through the DCNN based on the specific requirement.

3.3. MHSA in Convolutional Block of SC-Unet

The MHSA (Multi-Head Self-Attention) method has high computational complexity. Its computational cost grows quadratically with the spatial resolution, which makes it unsuitable for pixel-level image reconstruction [30]. To address this issue, most existing image inpainting models use MHSA to process low-resolution feature maps while adopting CNNs for high-resolution images [26]. When the resolution of the feature map becomes 4 × 4 in this study, the corresponding convolutional block of SC-Unet gains a Multi-Head Self-Attention layer after the convolutional layer (as shown in Figure 3) to avoid excessive computation.

The MHSA layer processes the features from the 3 × 3 kernel of the convolutional layer as Q (Query), K (Key), and V (Value) inputs [31]. Through MHSA, similar features from the surrounding area are incorporated into the hole area, improving the content and structure information while effectively reducing checkerboard artifacts, smoothing the boundaries, and enhancing the details of reconstructed image [32].

3.4. The Loss Functions

The generator using the SC-Unet model is trained with a combined loss known as the content loss. The generator loss comprises reconstruction loss, perceptual loss, style loss, and total variation (TV) loss. Our goal with these loss functions is to achieve precise per-pixel reconstruction and smooth transitions of predicted hole area pixel values into their surrounding context.

In this study, the original image or ground truth image is denoted as

I_{g t}

. The mask image is represented as a binary image, denoted as

M

(0 for the hole region, 1 for the non-hole region). The input image, also known as the masked image, is denoted as

I_{i n} = I_{g t} ⊙ M

, with the symbol ⊙ indicating multiplication by pixel. The reconstructed image produced by the generator is referred to as

I_{o u t}

.

The reconstruction loss is used to measure the disparity between

I_{o u t}

and

I_{g t}

by utilizing the pixel-wise L1 distance. This paper defines

L_{h o l e}

as the reconstruction loss on the network output for the hole pixels and

L_{v a l i d}

as the reconstruction loss for the non-hole pixels.

C

,

H

, and

W

represent the channels, height, and width of the feature map, respectively. The formulae are defined as follows:

L_{v a l i d} = \frac{1}{C * H * W} {‖M ⊙ (I_{o u t} - I_{g t})‖}_{1}

(1)

L_{h o l e} = \frac{1}{C * H * W} {‖(1 - M) ⊙ (I_{o u t} - I_{g t})‖}_{1}

(2)

For capturing structure and texture information during generation learning, we introduce the perceptual loss. The perceptual loss is calculated using the L1 distance in the feature space of the pre-trained VGG-16 network [33] between

I_{o u t}

and

I_{g t}

, as well as between

I_{c o m p}

and

I_{g t}

.

I_{c o m p}

refers to a composite image that combines the hole region from

I_{o u t}

and the non-hole region from

I_{g t}

; see formula (3) for detail. The VGG-16 network has been pre-trained on the ImageNet dataset to map these images into higher-level feature spaces [34]. The formulae for

I_{c o m p}

and

L_{p e r c e p t u a l}

are defined as follows [35]:

I_{c o m p} = M ⊙ I_{g t} + (1 - M) ⊙ I_{o u t}

(3)

L_{p e r c e p t u a l} = \sum_{i = 0}^{N - 1} \frac{1}{C_{i} H_{i} W_{i}} {‖Ψ_{i}^{I_{o u t}} - Ψ_{i}^{I_{g t}}‖}_{1} + \sum_{i = 0}^{N - 1} \frac{1}{C_{i} H_{i} W_{i}} {‖Ψ_{i}^{I_{c o m p}} - Ψ_{i}^{I_{g t}}‖}_{1}

(4)

Here,

Ψ_{i}^{I_{o u t}}

,

Ψ_{i}^{I_{c o m p}}

, and

Ψ_{i}^{I_{g t}}

is the feature map of the ith layer of VGG-16 when given original input image are

I_{o u t}

,

I_{c o m p}

and

I_{g t}

. We selected layers pool1, pool2, and pool3 for our perceptual loss calculate [17].

H_{i}

,

W_{i}

, and

C_{i}

denote the height, weight, and channel size of the ith feature map, respectively.

In addition to the perceptual loss, we also incorporate style loss to maintain style consistency. The style loss involves comparing the L1 distance between images in feature space after calculating the corresponding Gram matrices [36] for each selected feature map. We assume that the high-level features in the ith layer shape are represented by

(H_{i} W_{i}) \times C_{i}

, and we use

{(Ψ_{i}^{p})}^{T} (Ψ_{i}^{p})

to denote the auto-correlation (gram matrix) constructed from the given feature map p in the ith selected layer. Similar to the perceptual loss, the style loss is calculated using the L1 distance of the gram matrix between

I_{o u t}

and

I_{g t}

, as well as between

I_{c o m p}

and

I_{g t}

. We specifically selected layers pool1, pool2, and pool3 of the pre-trained VGG-16 for our perceptual loss calculation. The formulae are defined as follows:

L_{s t y l e_{o u t}} = \sum_{i = 0}^{N - 1} \frac{1}{C_{i} C_{i}} {‖\frac{1}{H_{i} W_{i} C_{i}} ({(Ψ_{i}^{I_{o u t}})}^{T} (Ψ_{i}^{I_{o u t}}) - {(Ψ_{i}^{I_{g t}})}^{T} (Ψ_{i}^{I_{g t}}))‖}_{1}

(5)

L_{s t y l e_{c o m p}} = \sum_{i = 0}^{N - 1} \frac{1}{C_{i} C_{i}} {‖\frac{1}{H_{i} W_{i} C_{i}} ({(Ψ_{i}^{I_{c o m p}})}^{T} (Ψ_{i}^{I_{c o m p}}) - {(Ψ_{i}^{I_{g t}})}^{T} (Ψ_{i}^{I_{g t}}))‖}_{1}

(6)

Total variation (TV) loss could ameliorate checkerboard artifacts generate by perceptual loss [37]. where R is the region of 1-pixel dilation of the hole region [38].

N_{I_{c o m p}}

is the number of elements in

I_{c o m p}

, i and j represent the coordinates of the pixels in the feature map.

L_{t v} = \sum_{(i, j) \in R, (i, j + 1) \in R} \frac{{‖I_{c o m p}^{i, j + 1} - I_{c o m p}^{i, j}‖}_{1}}{N_{I_{c o m p}}} + \sum_{(i, j) \in R, (i + 1, j) \in R} \frac{{‖I_{c o m p}^{i + 1, j} - I_{c o m p}^{i, j}‖}_{1}}{N_{I_{c o m p}}}

(7)

The generator loss

L_{G}

, denoted as a combination of all the loss functions mentioned above, is given by the following formula:

L_{G} = λ_{v a l i d} L_{v a l i d} + λ_{h o l e} L_{h o l e} + λ_{p e r c e p t u a l} L_{p e r c e p t u a l} + λ_{s t y l e} (L_{s t y l e_{o u t}} + L_{s t y l e_{c o m p}}) + λ_{t v} L_{t v}

(8)

For the tradeoff parameters, we empirically selected values of 1 for

λ_{v a l i d}

, 6 for

λ_{h o l e}

, 0.1 for

λ_{p e r c e p t u a l}

, 80 for

λ_{s t y l e}

, and 0.5 for

λ_{t v}

. The loss term weights were determined by performing hyperparameter training on ImageNet.

3.5. The Adversarial Loss

In this study, the adversarial loss of the discriminator

L_{D}

was defined by Equation (9) in our model [23], where D(·) is the discriminator, G(·) is the generator, and M represents the mask image (0 for the hole region, 1 for the non-hole region).

I_{g t}

represents the ground truth image.

I_{i n}

represents the input image by

I_{i n} = I_{g t} ⊙ M

.

I_{o u t}

is the image reconstructed by the generator; refer to Figure 4.

E

represents the expected value. We limit the discriminator weights to the range of [−0.01, 0.01] to improve the stability of our model.

L_{D} = - E_{I_{o u t}} [M - D (I_{o u t})], W_{D} \in [- 0.01, 0 . 01]

(9)

The total adversarial loss

L_{t o t a l}

of our model is transformed into Equation (10). The generator loss measure the overall difference between the

I_{o u t}

and

I_{g t}

, while the adversarial loss give more gradient information about hole area to generator.

λ_{a d v}

is a hyperparameter that controls the adversarial loss of the discriminator. Through

λ_{a d v}

, the strength of the discriminator is adjusted to balance the generator and discriminator.

λ_{a d v}

ranges from 0 to 1. The larger the value of

λ_{a d v}

, the stronger the discriminator becomes. If the generator is too strong and the discriminator is too weak, the entire network may ignore the discriminator and train the generator separately. This operation may lead to the generator being unable to reconstruct the image effectively. The predicted mask output by the discriminator may be very different from the original mask. For instance, the predicted mask image may have a large dark gray area, or the entire image may be dark gray. On the other hand, if the generator is too weak and the discriminator is too strong, the predicted mask may be sharper and closely resemble the original mask. Still, the predicted mask cannot provide enough gradient information to the generator. Ideally, we want the discriminator to be balanced—not too strong or weak. The predicted mask gives a rough estimate of the hole area’s extent but not an exact one.

L_{t o t a l} = L_{G} + λ_{a d v} L_{D}, λ_{a d v} = 0.15

(10)

3.6. The Main Framework

As shown in Figure 4, the white region in the input image represents the hole area. In contrast, the mask image is an initial binary image where the black region represents the hole area. The output image undergoes an L1 loss operation with the mask image in the discriminator. The discriminator generates the predicted mask image. This predicted mask image is a grayscale image with a grayscale value ranging from 0 to 1. The depth of each pixel’s color signifies the likelihood that the pixel belongs to the hole area. If a pixel in the predicted mask image is darker, it indicates that the grayscale value of this pixel is closer to 1, indicating a higher probability of belonging to the hole area; otherwise, it belongs to the non-hole area. By using this information, the predicted mask image can give the generator more gradient information about the hole areas.

4. Experiments

4.1. Experimental Settings

We conducted training and testing over ImageNet 2012 and CelebA-HQ. The ImageNet 2012 Dataset comprises roughly 1.2 million training images, 50,000 validation images, and 100,000 test images with no published labels. The CelebA-HQ dataset contains 30,000 high-resolution face images selected from the CelebA dataset. From there, we chose 29,000 images for the training set and 1000 for the test set. We utilized the original “train”, “test”, and “val” splits of ImageNet and CelebA-HQ for our training, testing, and validation. This model uses PyTorch (2.1.2) on an i9-10900k CPU and NVIDIA RTX3090 GPU (NVIDIA, Santa Clara, CA, USA) with a batch size of 16. We trained the model for 400,000 iterations with a learning rate 1 × 10⁻⁴, utilizing the Adam optimizer for network convergence. All masks and images are resized to 256 × 256. A random combination of rectangles, circles, ovals, and sectors generated all the mask images.

Figure 5 and Figure 6 display some results of inpainting images by our model. From left to right, the images show the ground truth image, input image, reconstructed image, mask image, and predicted mask image.

4.2. Quantitative Evaluation

In this study, we compare our method with Contextual Attention (CA) [15] and Globally and Locally Consistent Image Completion (GLCIC) [20] using their respective pre-trained weights for a quantitative analysis.

–: CA is a method proposed by Jiahui Yu. This approach uses the contextual attention mechanism to establish long-range dependencies of the feature map for inpainting.
–: GLCIC is a method proposed by Iizuka et al. This approach utilizes a single completion network for image completion and global and local context discriminator networks to realistically complete images.

To evaluate our model’s performance, we selected the PSNR (peak signal-to-noise ratio) [39], SSIM (structural similarity index) [40], and FID (Fréchet Inception Distance) [41] as indicators.

PSNR is frequently used in image quality assessment for image inpainting tasks. A higher PSNR value indicates better image quality. SSIM assesses the reconstructed image by comparing its brightness, contrast, and structural attributes to those of the original images. SSIM values are positively correlated with image inpainting effectiveness. FID (Fréchet Inception Distance) is a commonly used numerical measure in image generation to assess the similarity between the reconstructed image and the ground truth (GT) images. Lower FID values indicate a closer match to the distribution of GT images, typically suggesting higher-quality generated images.

According to the masks with different hole area ratios, the performance of other models is presented in Table 1, Table 2 and Table 3. These tables display the quantitative evaluation results for the ImageNet and CelebA-HQ datasets. Our method outperforms CA and GLGIC in terms of PSNR, SSIM, and FID on both datasets.

4.3. Qualitative Evaluation

In Figure 5 and Figure 6, some parts of the predicted mask image are partially hidden or unclear, with the gray value indicating the likelihood of these pixels belonging to the hole area. A deeper color suggests that a pixel is close to 1 and more likely to belong to the hole area. The inpainted images generated by our model closely resemble the ground truth. Figure 7 and Figure 8 show that the GLCIC and CA methods failed to eliminate checkerboard artifacts generated by perceptual loss [37]. However, our model alleviated the checkerboard artifacts, as illustrated in Figure 5, Figure 6, Figure 7 and Figure 8. In the predicted mask images shown in Figure 7 and Figure 8, the main features outlined in the ground truth image can be faintly seen, such as a bird, worm, face, or other objects. One reason is that the MHSA module helps the model search the entire image for relevant content to fill the hole regions. Another reason is that the full skip connections transfer features extracted between the encoder and decoder.

4.4. Ablation Study

In order to verify the effectiveness of the DCNN, MHSA, SC-Unet, generator, and discriminator modules, the following five models were designed:

Model-1: This model uses the original U-Net with the generator of WGAN for training. The modules of this model include U-Net and generator modules.

Model-2: Compared to Model-1, this model involves the discriminator of WGAN for training. The modules of this model include U-Net, generator, and discriminator modules.

Model-3: Compared to Model-2, this model uses SC-Unet instead of U-Net for training. The other modules remain the same. The convolutional block of SC-Unet still uses the original convolutional block of U-Net. Refer to Figure 2 left. The modules of this model include SC-Unet, generator, and discriminator modules.

Model-4: Compared to Model-3, this model adds DCNN in the convolutional block of SC-Unet for training. Please refer to Figure 2 right. The modules remain the same. This model includes SC-Unet, DCNN, generator, and discriminator modules.

Model-5: Compared to Model-4, this model adds MHSA to the convolutional block of SC-Unet for training. Please refer to Figure 3. The other modules remain the same. This model includes the SC-Unet, DCNN, MHSA, generator, and discriminator.

In the ablation study, we tested different model modules, as indicated in Figure 9 and Figure 10, Appendix A and Table 4, Table 5 and Table 6. Compared to Model-1, Model-2 showed improved inpainting results, demonstrating that including the discriminator’s adversarial loss could significantly enhance the generator’s performance in image inpainting. Model-3 used SC-Unet instead of U-Net for training, but the convolutional block still used the original U-Net. This result demonstrated that the SC-Unet is effective in image inpainting. These full skip connections of SC-Unet help propagate gradients across convolutional layers during back-propagation, which, in turn, allows the network to be deeper and prevents gradient vanishing. The parameter count of Model-3 is 6,214,8628. The video memory of the GPU is 12.8 GB, and one epoch takes about 691 min. Model-4 incorporated DCNN in the convolutional block, which enlarged the receptive fields to enhance network representation. The parameter count of Model-4 is 194,257,220. The video memory of the GPU is 16.3 GB, and one epoch takes about 870 min. Model-5 added MHSA to the convolutional block. The parameter count of Model-5 is 213,877,604. The video memory of the GPU is 16.7 GB, and one epoch takes about 940 min. The number of trainable parameters has increased by about 10%. Although Model-5 only increases the number of parameters by about 10%, with the help of MHSA, Model-5 could search the entire image for appropriate content to fill the missing regions. Model-4 and Model-5 further improved the inpainting results. Model-5 is our final model. The output of Model-5 exhibited the smoothest edges and the richest semantic information, effectively synthesizing the hole and non-hole area.

5. Discussions

We have developed a method called Symmetric Connected U-Net (SC-Unet) and integrated it into the generator and discriminators of WGAN. Our model also included dilated convolutional neural networks (DCNNs) and MHSA. However, we have noticed that our current model has limitations in effectively repairing large or highly irregular hole area. As the size of the hole area increases, SC-Unet requires a larger receptive field and a deeper network for image inpainting, which means that more computation and training resources are needed. From the ablation study, we can see that after adding DCNN to Model-4 compared to Model-3, the number of parameters increased from 62,148,628 to 194,257,220, almost tripling in size. Therefore, adding a DCNN will significantly increase the computation of the entire network. Consequently, it is difficult for this model to complete image inpainting of such large-scale hole area under existing conditions.

Our next research step involves inpainting images with large hole area. One potential solution could be to improve the network structure, for example, by implementing a multi-stage inpainting approach where coarse structures are filled first, followed by fine details. Additionally, exploring a new technique to expand the receptive field may be beneficial. Since there are similarities between image inpainting and image generation, it might be valuable to incorporate techniques used in image generation models, such as models based on diffusion techniques.

Author Contributions

Conceptualization, Y.H. and X.M.; methodology, Y.H. and X.M; software, Y.H. and X.M.; validation, Y.H. and X.M.; analysis, Y.H. and X.M.; resources, Y.H. and X.M.; data curation, J.Z. and C.G.; visualization, J.Z. and C.G.; supervision, Y.H. and X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Henan Provincial Science and Technology Research Project (grant numbers: 222102220107 and 242102210101).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are from public datasets that can be downloaded from the public data providers https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 1 September 2024) and https://www.image-net.org/ (accessed on 1 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Appendix A

Figure A1. Ablation study with different model modules on ImageNet.

Figure A2. Ablation study with different model modules on ImageNet.

Figure A3. Ablation study with different model modules on CelebA-HQ.

Figure A4. Ablation study with different model modules on CelebA-HQ.

References

Sun, J.; Yuan, L.; Jia, J. Image completion with structure propagation. In ACM SIGGRAPH 2005 Papers; Association for Computing Machinery: New York, NY, USA, 2005; pp. 861–868. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 7–12. [Google Scholar]
Xu, Y.; Gu, T.; Chen, W. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv 2024, arXiv:2403.01779. [Google Scholar]
Chunqi, F.; Kun, R.; Lisha, M. Advances in digital image inpainting algorithms based on deep learning. J. Signal Process 2020, 36, 102–109. [Google Scholar]
Qin, Z.; Zeng, Q.; Zong, Y. Image inpainting based on deep learning: A review. Displays 2021, 69, 102028. [Google Scholar] [CrossRef]
Mardieva, S.; Ahmad, S.; Umirzakova, S.; Rasool, M.A.; Whangbo, T.K. Lightweight image super-resolution for IoT devices using deep residual feature distillation network. Knowl.-Based Syst. 2024, 285, 111343. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 1–26 July 2016; pp. 770–778. [Google Scholar]
Jiao, L.; Wu, H.; Wang, H.; Bie, R. Multi-scale semantic image inpainting with residual learning and GAN. Neurocomputing 2019, 331, 199–212. [Google Scholar] [CrossRef]
Araujo, A.; Norris, W.; Sim, J. Computing receptive fields of convolutional neural networks. Distill 2019, 4, e21. [Google Scholar] [CrossRef]
Phutke, S.S.; Murala, S. Diverse receptive field based adversarial concurrent encoder network for image inpainting. IEEE Signal Process. Lett. 2021, 28, 1873–1877. [Google Scholar] [CrossRef]
Yu, F. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Yan, Z.; Li, X.; Li, M. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
Yu, J.; Lin, Z. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
Liu, W.; Shi, Y.; Li, J. Multi-stage Progressive Reasoning for Dunhuang Murals Inpainting. In Proceedings of the 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 211–217. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
Im, D.J.; Kim, C.D.; Jiang, H. Generating Images with Recurrent Adversarial Networks. arXiv 2016, arXiv:1602.05110. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph 2017, 36, 1–14. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Arjovsky, M.; Chintala, S. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Mirza, M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Lou, S.; Fan, Q.; Chen, F. Preliminary investigation on single remote sensing image inpainting through a modified GAN. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing, PRRS, Beijing, China, 19–20 August 2018; pp. 1–6. [Google Scholar]
Deng, Y.; Hui, S.; Zhou, S. Learning contextual transformer network for image inpainting. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2529–2538. [Google Scholar]
Wang, N.; Ma, S.; Li, J. Multistage attention network for image inpainting. Pattern Recognit. 2020, 106, 107448. [Google Scholar] [CrossRef]
Li, J.; Wang, N.; Zhang, L. Recurrent feature reasoning for image inpainting. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7757–7765. [Google Scholar]
Liu, H.; Jiang, B. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4170–4179. [Google Scholar]
Wang, W.; Xie, E.; Li, X. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Wang, Q.; He, S.; Su, M. Context-Encoder-Based Image Inpainting for Ancient Chinese Silk. Appl. Sci. 2024, 14, 6607. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 27–30. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
Karras, T.; Aila, T.; Laine, S. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR), IEEE, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]

Figure 1. SC-Unet architecture, with a 3 × 256 × 256 image as input.

Figure 2. (left) The original convolutional block of U-Net; (right) the Dilated Convolutional Neural Network in the convolutional block of SC-Unet.

Figure 3. Multi-Head Self-Attention in the convolutional block of SC-Unet.

Figure 4. The main framework of our image inpainting model.

Figure 5. Sample of results with CelebA-HQ dataset. (a) Ground truth image. (b) Input image. (c) Reconstructed image. (d) Mask image. (e) Predicted image.

Figure 6. Sample of results with ImageNet dataset. (a) Ground truth image. (b) Input image. (c) Reconstructed image. (d) Mask image. (e) Predicted image.

Figure 7. Comparison with GLCIC and CA models on ImageNet.

Figure 8. Comparison with GLCIC and CA models on CelebA-HQ.

Figure 9. Ablation study with different model modules on ImageNet.

Figure 10. Ablation study with different model modules on CelebA-HQ.

Table 1. PSNR in comparison with CA and GLCIC on ImageNet and CelebA-HQ; higher is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	CA	32.45	28.13	24.97	22.74	20.18
	GLCIC	30.09	26.75	23.88	21.61	19.55
	Ours	35.31	31.12	28.09	24.56	22.39
CelebA-HQ	CA	32.36	27.89	25.86	23.73	20.12
	GLCIC	30.13	25.91	22.02	19.60	17.37
	Ours	36.19	33.86	30.55	26.67	23.39

Table 2. SSIM in comparison with CA and GLCIC on ImageNet and CelebA-HQ; higher is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	CA	0.937	0.875	0.801	0.720	0.650
	GLCIC	0.910	0.855	0.754	0.660	0.609
	Ours	0.960	0.910	0.870	0.818	0.784
CelebA-HQ	CA	0.959	0.909	0.832	0.750	0.721
	GLCIC	0.936	0.860	0.799	0.711	0.689
	Ours	0.975	0.932	0.899	0.851	0.794

Table 3. FID in comparison with CA and GLCIC on ImageNet and CelebA-HQ; lower is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	CA	12.84	23.13	35.69	44.72	56.19
	GLCIC	15.09	32.75	46.88	55.61	69.55
	Ours	8.15	15.26	26.12	33.26	45.88
CelebA-HQ	CA	3.24	6.87	12.83	20.01	38.14
	GLCIC	4.24	10.91	18.76	29.33	47.37
	Ours	1.69	2.36	4.58	6.12	8.19

Table 4. PSNR comparison with different model modules on ImageNet and CelebA-HQ; higher is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	Model-1	23.89	22.35	19.26	15.39	14.17
	Model-2	26.61	24.49	21.13	18.44	15.39
	Model-3	30.06	27.55	23.67	21.56	19.38
	Model-4	33.15	28.69	26.44	22.99	20.18
	Model-5	35.31	31.12	28.09	24.56	22.39
CelebA-HQ	Model-1	24.86	23.25	19.76	17.19	15.13
	Model-2	27.96	25.06	21.73	19.04	16.19
	Model-3	30.18	29.60	25.85	22.76	19.33
	Model-4	32.64	31.53	28.04	24.23	22.86
	Model-5	36.19	33.86	30.55	26.67	23.39

Table 5. SSIM comparison with different model modules on ImageNet and CelebA-HQ; higher is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	Model-1	0.809	0.742	0.633	0.609	0.585
	Model-2	0.850	0.793	0.704	0.654	0.629
	Model-3	0.899	0.842	0.755	0.702	0.684
	Model-4	0.933	0.899	0.811	0.758	0.735
	Model-5	0.960	0.910	0.870	0.818	0.784
CelebA-HQ	Model-1	0.813	0.748	0.694	0.621	0.592
	Model-2	0.862	0.801	0.755	0.684	0.643
	Model-3	0.895	0.842	0.796	0.743	0.691
	Model-4	0.923	0.901	0.851	0.796	0.746
	Model-5	0.975	0.932	0.899	0.851	0.794

Table 6. FID comparison with different model modules on ImageNet and CelebA-HQ; lower is better.

Metrics	Methods	1–10%	10–20%	20–30%	30–40%	40–50%
ImageNet	Model-1	54.27	68.29	72.35	86.05	112.87
	Model-2	39.27	52.33	66.19	74.77	93.15
	Model-3	28.47	44.59	57.22	69.23	82.00
	Model-4	15.14	31.66	41.69	58.19	73.61
	Model-5	12.15	23.26	35.12	51.26	66.88
CelebA-HQ	Model-1	21.86	38.11	51.28	72.39	93.16
	Model-2	13.25	26.76	39.46	55.84	64.27
	Model-3	7.60	11.29	20.85	32.76	37.37
	Model-4	3.96	6.13	9.04	13.23	19.86
	Model-5	1.69	2.36	4.58	6.12	8.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, Y.; Ma, X.; Zhang, J.; Guo, C. Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting. Symmetry 2024, 16, 1423. https://doi.org/10.3390/sym16111423

AMA Style

Hou Y, Ma X, Zhang J, Guo C. Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting. Symmetry. 2024; 16(11):1423. https://doi.org/10.3390/sym16111423

Chicago/Turabian Style

Hou, Yanyang, Xiaopeng Ma, Junjun Zhang, and Chenxian Guo. 2024. "Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting" Symmetry 16, no. 11: 1423. https://doi.org/10.3390/sym16111423

APA Style

Hou, Y., Ma, X., Zhang, J., & Guo, C. (2024). Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting. Symmetry, 16(11), 1423. https://doi.org/10.3390/sym16111423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting

Abstract

1. Introduction

2. Related Work

3. Approach

3.1. The Principle of the SC-Unet Model

3.2. DCNN in Convolutional Block of SC-Unet

3.3. MHSA in Convolutional Block of SC-Unet

3.4. The Loss Functions

3.5. The Adversarial Loss

3.6. The Main Framework

4. Experiments

4.1. Experimental Settings

4.2. Quantitative Evaluation

4.3. Qualitative Evaluation

4.4. Ablation Study

5. Discussions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI