Next Article in Journal
The Aesthetics and Pragmatics of Symmetry in High-Gain and Wideband Circularly Polarized Antenna Design
Previous Article in Journal
Utilizing Potential Field Mechanisms and Distributed Learning to Discover Collective Behavior on Complex Social Systems
Previous Article in Special Issue
Enhancing Transportation Efficiency with Interval-Valued Fermatean Neutrosophic Numbers: A Multi-Item Optimization Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HE-CycleGAN: A Symmetric Network Based on High-Frequency Features and Edge Constraints Used to Convert Facial Sketches to Images

School of Computer Science, Northeast Electric Power University, Jilin 132012, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(8), 1015; https://doi.org/10.3390/sym16081015
Submission received: 12 July 2024 / Revised: 2 August 2024 / Accepted: 6 August 2024 / Published: 8 August 2024
(This article belongs to the Special Issue Symmetry with Optimization in Real-World Applications)

Abstract

:
The task of converting facial sketch images to facial images aims to generate reasonable and clear facial images from a given facial sketch image. However, the facial images generated by existing methods are often blurry and suffer from edge overflow issues. In this study, we proposed HE-CycleGAN, a novel facial-image generation network with a symmetric architecture. The proposed HE-CycleGAN has two identical generators, two identical patch discriminators, and two identical edge discriminators. Therefore, HE-CycleGAN forms a symmetrical architecture. We added a newly designed high-frequency feature extractor (HFFE) to the generator of HE-CycleGAN. The HFFE can extract high-frequency detail features from the feature maps’ output, using the three convolutional modules at the front end of the generator, and feed them to the end of the generator to enrich the details of the generated face. To address the issue of facial edge overflow, we have designed a multi-scale wavelet edge discriminator (MSWED) to determine the rationality of facial edges and better constrain them. We trained and tested the proposed HE-CycleGAN on CUHK, XM2VTS, and AR datasets. The experimental results indicate that HE-CycleGAN can generate higher quality facial images than several state-of-the-art methods.

1. Introduction

The conversion between facial sketch images and facial images aims to establish a mapping between facial sketch images and facial images. The technology of converting facial sketch images into facial images can be applied in the field of criminal investigation, providing assistance for more accurate recognition or the retrieval of the suspect’s face. The technology of converting facial images into facial sketch images can recognize manga facial images and apply them to the field of digital entertainment. This study focuses on converting facial sketches into facial images.
By reviewing existing methods for converting facial sketch images to facial images, we found that most of these methods suffer from edge overflow in the generated facial images. At the position of the purple box in Figure 1, the facial images generated using existing methods [1,2,3,4] have varying degrees of overflow at the edges of the face. This indicates that existing methods have insufficient constraints on facial edges.
At present, the representative methods in the field of image transformation are the method based on generative adversarial networks (GANs [5]) and the method based on CycleGAN [1]. Figure 1 illustrates a representative GAN-based method, Pix2Pix [5], and three CycleGAN-based methods, CSGAN [2], CDGAN [3], and LSCIT [4]. However, GAN-based methods require a large amount of fully aligned training data, and obtaining a large amount of fully aligned training data is difficult [6]. In the task of converting facial sketch images to facial images, although there are paired facial sketch images and facial images, they are not completely aligned. As shown in Figure 2, the face length of the facial sketch images in the first column are significantly longer than the corresponding facial images (Ground Truth, GT) in the second column. This may make it difficult for Pix2Pix (GAN-based method) to generate high-quality facial images. As shown in Figure 2, the facial images generated by Pix2Pix are generally very blurry. However, CycleGAN does not require fully aligned facial sketch image and facial image data [7]. As shown in Figure 2, the quality of facial images generated by CycleGAN is significantly better than that generated by Pix2Pix. However, we found that the facial images generated by CycleGAN still have issues with blurring, and some details in the facial sketch images are lost. As shown in Figure 3, we input the facial sketch image (Figure 3a) into CycleGAN to generate a facial image (Figure 3b). We found that some detailed information was lost at the hair gap, shown in the red box in Figure 3b. However, we observed the phenomenon that when we input the generated facial image (Figure 3b) into CycleGAN and reconstructed it into a facial sketch image (Figure 3c), the reconstructed result restored the detailed information at the hair gap. This phenomenon indicates that in order to satisfy the constraint of cycle consistency, the generator tends to hide some information in the input image [8,9,10,11], which will make it difficult for CycleGAN to generate rich details.
In order to ensure that the generated facial images have rich details and reasonable edges, this paper proposes HE-CycleGAN: CycleGAN based on high-frequency detail features and edge constraints. HE-CycleGAN is a deep neural network with a symmetric structure consisting of two generators and four discriminators. The main innovation of this paper lies in extracting high-frequency information from the feature maps of deep neural networks and propagating the high-frequency information backward and reusing it. Based on this idea, we designed two generators and two additional discriminators for HE-CycleGAN. In the generator of HE-CycleGAN, we use a high-frequency feature extractor (HFFE) to extract high-frequency detail information from the feature maps at the encoding end and send it to the decoding end to enrich the details of the generated face. We have designed a multi-scale wavelet edge discriminator (MSWED) to address the issue of facial edge overflow. MSWED utilizes wavelet transforms to extract facial edge information at each scale. Afterwards, MSWED judges the rationality of facial edges based on facial edge information, achieving the goal of constraining facial edges and alleviating edge overflow problems. Our contribution has the four following aspects:
(1)
We propose a network called HE-CycleGAN for converting facial sketch images to facial images.
(2)
We added a high-frequency feature extractor (HFFE) to the generator of HE-CycleGAN, which alleviates the problem of losing details in facial images generated by the traditional CycleGAN to meet the constraint of cyclic consistency.
(3)
We designed a multi-scale wavelet edge discriminator (MSWED). This MSWED can solve the problem of generated facial edge overflow.
(4)
Finally, we quantitatively and qualitatively validated the effectiveness of the proposed HE-CycleGAN.
The rest of this paper is organized as follows: Section 2 reviews the relevant work on the conversion of facial sketch images to facial images. Section 3 provides a detailed introduction to the proposed HE-CycleGAN. The effectiveness of HE-CycleGAN is demonstrated through experiments detailed in Section 4. Finally, Section 5 provides a summary of this paper.

2. Related Work

Early methods were based on traditional machine learning methods. Wang and Tang [12] proposed a multi-scale Markov Random Field (MRF) model. To generate facial images, the facial regions are divided into overlapping small blocks for learning. Xiao et al. [13] proposed an embedded hidden Markov model (EHMM) that can learn non-linear mapping between facial sketch image and facial image pairs with fewer training samples and generate pseudo-images in the form of sketch images. Based on traditional methods, training models do not require a large amount of data, but the generated facial images are very blurry and often have a lot of missing texture information.
Deep learning methods have shown better results than traditional methods. Deep learning methods mainly include convolutional neural network (CNN)-based methods and GAN-based methods. CNNs are a special type of neural network that use convolutional computation for feature extraction, enabling the classification and prediction of input data [14]. Zhang et al. [15] proposed an end-to-end fully convolutional network to directly model the non-linear mapping between facial images and sketch images, providing effective pixel prediction. However, the facial images generated by this solution lack clear edges and contours. The GAN consists of a generator and a discriminator. The generator is responsible for generating realistic images, while the discriminator is responsible for distinguishing between generated images and real images [16,17]. The Pix2Pix proposed by Isola et al. [5] provides a universal framework for image-to-image conversion problems using conditional GANs. Li et al. [18] proposed a face image generation method with a conditional self-attention GAN. Capturing image structural information through self-attention alleviates the problem of difficulty in generating facial structures. Chen et al. [19] proposed an implicit spatial modeling GAN to alleviate the problem of overfitting facial sketch images. Li et al. [20] proposed a two-stage GAN for face generation. This involved maintaining semantic consistency through semantic loss and generating fine-grained information through color refinement loss to better preserve facial attributes. Sun et al. [21] proposed a new end-to-end generative adversarial fusion model. By parameterizing the Tahn activation function to learn the facial illumination highlight distribution, the problem of facial illumination highlight distribution is alleviated. However, as discussed in the third paragraph of the introduction section, due to the lack of fully aligned training data, the facial images generated by GAN-based methods are relatively blurry and lack rich texture details.
Zhu et al. [1] proposed that the cycle-consistency adversarial network (CycleGAN) can serve as a new framework for image-to-image conversion tasks. Compared to GAN, CycleGAN has achieved significant improvement in the quality of facial images generated. Babu K K et al. [2] proposed a cyclic-synthesized generative adversarial network (CSGAN). By using a cyclic-synthesized loss function to reduce the difference between the generated image and the cyclic image, the quality of image generation can be improved. Compared to CycleGAN, CSGAN has achieved significant improvements in generating facial images, but there are still issues with blurred details and facial edge overflow. Babu K K et al. [3] proposed a cyclic discriminative generative adversarial network (CDGAN) based on the cyclic-synthesized generative adversarial networks. By introducing new cyclic discriminative adversarial loss, the complexity of adversarial learning can be increased, further improving the quality of image generation. Compared to CSGAN, CDGAN generates clearer facial images, but there are still a few issues with blurred details and facial edge overflow. Wang et al. [4] proposed an unsupervised long–short cycle-consistent adversarial network (LSCIT) to address the issue of error accumulation during image conversion in the cycle-consistency adversarial network. This was done by utilizing the constraint of long–short cycle consistency to eliminate error accumulation. LSCIT can generate clearer facial images. However, there is a problem with artifacts in some of the facial images generated by LSCIT, and there is still overflow at the edges of the face. In order to meet the constraint of cycle consistency, the generator of CycleGAN tends to hide some information in the input image. Therefore, the facial images generated by the CycleGAN-based method mentioned above are more or less blurry, and there is also a problem with facial edge overflow. The HE-CycleGAN proposed in this paper aims to solve the problems of image blurring and facial edge overflow generated by CycleGAN-based methods. The high-frequency feature extractor (HFFE) in the generator of HE-CycleGAN will alleviate the problem of losing details in facial images generated by traditional CycleGAN. Meanwhile, the multi-scale wavelet edge discriminator (MSWED) of HE-CycleGAN can solve the problem of edge overflow in generated facial images.

3. The Proposed Method

The architecture of the proposed HE-CycleGAN is shown in Figure 4. HE-CycleGAN has six parts: two generators, G photo and G sketch , and four discriminators, D photo ,   D sketch , D photo - edge , and D sketch - edge . G photo is used to convert facial sketch images into facial images, and G sketch is used to convert facial images into facial sketch images. D photo is used to determine the authenticity of the generated facial image, while D sketch is used to determine the authenticity of the generated facial sketch image. D photo - edge is used to determine whether the edges of the generated facial image are reasonable, and D sketch - edge is used to determine whether the edges of the generated facial sketch image are reasonable. The entire network has a symmetrical architecture; that is, the G photo and G sketch networks have the same structure, the D photo and D sketch networks have the same structure, and the D photo - edge and D sketch - edge networks also have the same structure. D photo and D sketch are patch discriminators [22], while D photo - edge and D sketch - edge are multi-scale wavelet edge discriminators (MSWEDs).
HE-CycleGAN is divided into two branches. One of the branches inputs the facial sketch image X into the generator G photo to generate the facial image Y ^ . The generated facial image Y ^ is evaluated using D photo and D photo - edge , and Y ^ is input into G sketch to reconstruct the facial sketch image X ^ . The other branch inputs the facial image Y into the generator G sketch to generate a facial sketch image X ^ . The generated facial sketch image X ^ is evaluated using D sketch and D sketch - edge , and X ^ is input into G photo to reconstruct the facial image Y ^ . D photo and D photo - edge optimize G photo through adversarial loss.   D sketch and D sketch - edge optimize G sketch through adversarial loss.

3.1. Generator Network Structure

The two generators G photo and G sketch of HE-CycleGAN have the same structure. As shown in Figure 5, the generator consists of an encoder, a decoder, nine residual modules (RB), and three high-frequency feature extractors (HFFEs). The encoder consists of three convolutional modules, each containing a convolutional layer, instance normalization, and activation function (ReLU). As shown in Figure 5, the residual block (RB) in the dashed box consists of two convolution modules. The first convolutional module includes a convolutional layer, instance normalization, and activation function (ReLU). The second convolution module only contains a convolutional layer and instance normalization. Nine residual modules are responsible for extracting advanced semantic features and transmitting them to the decoder. The decoder consists of three convolutional modules, with the first two consisting of upsampling, a convolutional layer, instance normalization, and activation function (ReLU). The first two convolution modules use upsampling joint convolution instead of transposed convolution to avoid the phenomenon of chessboard artifacts [23]. The last convolutional module consists of a convolutional layer and a Tanh activation function. This is due to the fact that the facial sketch image input to generator G photo is normalized to between [−1, 1], and the output of generator G photo needs to be input to generator G sketch to reconstruct the facial sketch image. The output of the generator must also be normalized to between [−1, 1], and the Tanh activation function can ensure that the pixel range of the output is between [−1, 1], so the final output uses the Tanh activation function. The decoder is responsible for gradually restoring the details and dimensions of high-level semantic features, and for decoding them into facial images. The HFFE modules extract high-frequency features from three convolutional modules in the encoder and use skip connections to deliver them to the corresponding three convolutional modules in the decoder. The high-frequency features and the original features in the decoder form a combination of details and semantics in the decoder. The size of the feature maps output by each module is shown in Figure 5.
Due to the issue of cyclic consistency in the conversion of facial sketch images to facial images using the traditional CycleGAN, i.e., the problem of high-frequency details being hidden, it is difficult for the generator to generate rich detail information. We used wavelet transforms to obtain high-frequency detail information in the HFFE, and also employed channel attention (ECANet [24]) to enhance the high-frequency detail information.
The HFFE module mainly consists of wavelet transforms and channel attention (ECANet [24]). In this paper, we adopted the classic wavelet transform method: the Haar wavelet transform. We use Algorithm 1 to extract high-frequency detail features. The Haar wavelet transform contains four kernels, LL T , LH T , HL T , and HH T , with low (L)- and high (H)-filtering kernels being 1/√ 2 [1, 1] and 1/√ 2 [−1, 1], respectively. F LL T represents the feature map obtained after low-frequency filtering in the horizontal and vertical directions. F LH T represents the feature map obtained after horizontal low-frequency filtering and vertical high-frequency filtering. F HL T represents the feature map obtained after horizontal high-frequency filtering and vertical low-frequency filtering. F HH T represents the feature map obtained after high-frequency filtering in the horizontal and vertical directions. The algorithm for extracting four feature components ( F LL T , F HL T , F LH T , F HH T ) using the Haar wavelet transform is as follows.
Algorithm 1. Extract features using Haar wavelet transform
Input: F,  L L T ,   L H T ,   H L T ,   H H T  // F is the input feature
Output: W //include four feature components  F L L T ,   F L H T ,   F H L T ,  and  F H H T
K   [ L L T ,   L H T ,   H L T ,   H H T ] // define four filtering kernels as a list
F L L T ,   F L H T ,   F H L T ,   F H H T    Randomly initialize four feature components
W    [ F L L T ,   F L H T ,   F H L T ,   F H H T ]
Step   2 // step size during filtering
for each  W i ,   K i   in W, K
// traverse the position of each element in the output feature
  for each  W i x ,   W i y  in  W i  
     W i [ W i x ][ W i y ]    0
    // traverse the position of each element in the filtering kernel
    for each  K i x ,   K i y   i n   K i
      S  Calculate the product of  F [ W i x * S t e p + K i x ] [ W i y * S t e p + K i y ]  and
          K i [ K i x ] [ K i y ]
        W i [ W i x ][ W i y ]   Add S to  W i [ W i x ][ W i y ]
    end
  end
end
The structure of HFFE is shown in Figure 6. The F LL T component is mainly composed of low-frequency information. The F LH T , F HL T and F HH T components contain high-frequency detail information. Therefore, after performing wavelet transform on the feature map of the generator encoding end, we only retain the F LH T , F HL T , and F HH T components. The filtering step size of wavelet transform is 2. Therefore, the width and height of the F LH T , F HL T , and F HH T feature maps obtained after wavelet transform will become half of the original. So, we upsample the three feature maps F LH T , F HL T , and F HH T to their original sizes. The symbol represents adding up the upsampled features element by element. Afterwards, the feature map was passed through the channel attention (ECANet [24]) network. ECANet can emphasize useful features and suppress irrelevant ones, thereby further enhancing high-frequency detail information.

3.2. Discriminator Network Structure

Our proposed HE-CycleGAN includes four discriminators: D photo , D sketch , D photo - edge , and D sketch - edge . D photo and D sketch are patch discriminators [22] used to determine the authenticity of the generated facial images. D photo - edge and D sketch - edge are our newly designed multi-scale wavelet edge discriminators (MSWEDs).
The MSWED utilizes wavelet transform to extract facial edge information at each scale. Afterwards, MSWED judges the rationality of facial edges based on facial edge information. By utilizing MSWED to encourage the generator to generate more reasonable facial edges, the problem of facial edge overflow can be alleviated.
As shown in Figure 7, MSWED consists of six downsample modules, seven wavelet transform modules, seven convolutional layers, and one regression layer. The wavelet transform module includes the Haar wavelet transform and a 1 × 1 convolution. The box below, in Figure 7, shows the structure of the wavelet transform module. Wavelet transform is used to extract facial edge information ( F LH T , F HL T , and F HH T ) from different directions, and to concatenate the extracted facial edge information from different directions according to the channel. The 1 × 1 convolution is used to increase the dimension of the channel. For convolutional layers, the first six convolutional layers include 3 × 3 convolutions (with a step size of 2), instance normalization, and a LeakyReLU activation function, while the seventh convolutional layer only contains 3 × 3 convolutions. Convolutional layers are used to fuse facial edge features extracted at different scales, gradually reducing feature dimensions and expanding receptive fields. The last layer is the regression layer, which uses fully connected layers for scoring to determine whether the generated image edges are reasonable. The output size of each module is shown in Figure 7.

3.3. The Loss Function

Our proposed HE-CycleGAN includes four loss functions: adversarial loss, multi-scale wavelet edge discrimination adversarial loss, cycle-consistency loss, and color-identify loss. As shown in Figure 4, adversarial loss acts on the patch discriminator, multi-scale wavelet edge discrimination adversarial loss acts on the multi-scale wavelet edge discriminator, and cycle-consistency loss and color-identify loss act on the generator.

3.3.1. Adversarial Loss

The generator G photo converts the facial sketch image into a facial image, and the discriminator D photo uses Equation (1) to distinguish between the generated facial image G photo ( x ) and the real facial image y.
L P - GAN ( G photo , D photo , X , Y ) = E y ~ pdata y D photo y 1 2 + E x ~ pdata x D photo G photo x 2
Similarly, the generator G sketch converts the facial image into a facial sketch image, and the discriminator D sketch uses Equation (2) to distinguish between the generated facial sketch image G sketch ( y ) and the real facial sketch image x.
L P - GAN ( G sketch , D sketch , X , Y ) = E x ~ pdata x D sketch x 1 2 + E y ~ pdata y D sketch G sketch y 2
The adversarial loss function is defined as Equation (3):
L P - cycleGAN ( G photo ,   G sketch ,   D photo ,   D sketch ) = L P - GAN ( G photo , D photo , X , Y ) + L P - GAN ( G sketch ,   D sketch ,   X ,   Y )

3.3.2. Multi-Scale Wavelet Edge Discrimination Adversarial Loss

The discriminator D photo - edge uses Equation (4) to encourage the generator G photo to generate facial images with reasonable edges.
L E - GAN ( G photo , D photo - edge , X , Y ) = E y ~ pdata y D photo - edge y 1 2 + E x ~ pdata x D photo - edge G photo x 2
Similarly, the adversarial loss of discriminator D sketch - edge is defined as Equation (5):
L E - GAN ( G sketch , D sketch - edge , X , Y ) = E x ~ pdata x D sketch - edge x 1 2 + E y ~ pdata y D sketch - edge G sketch y 2
The multi-scale wavelet edge discrimination adversarial loss function is defined as Equation (6):
L E - cycleGAN ( G photo , G sketch , D photo - edge , D sketch - edge ) = L E - GAN ( G photo , D photo - edge , X , Y ) + L E - GAN ( G sketch , D sketch - edge , X , Y )

3.3.3. Cycle-Consistency Loss

The purpose of the cycle-consistency loss function is to ensure that the input facial sketch image x, after passing through G photo and G sketch , obtains a reconstructed image that remains as consistent as possible with the original image x. That is, x G photo G sketch ( G photo ) x . Similarly, it can be concluded that the reconstructed image y obtained after passing through G sketch and G photo still maintains as much content consistency as possible with the original image y. That is,   y G sketch G photo ( G sketch ) y . The cycle-consistency loss function adopts the L 1 norm and is defined as Equation (7):
L cycle G photo , G sketch = E x ~ pdata x G sketch G photo x x + E y ~ pdata y G photo G sketch y y

3.3.4. Color-Identify Loss

The purpose of color-identify loss is to maintain the original color tone during image conversion. When calculating the color-identify loss of generator G photo , the input of generator G photo is the facial image y, and the generated result is also the facial image. Similarly, the color-identify loss of generator G sketch can be found. The color-identify loss function adopts the L 1 norm and is defined as Equation (8):
L color - id G photo , G sketch = E x ~ pdata x G sketch x x + E y ~ pdata y G photo y y

3.3.5. HE-CycleGAN Objective Function

The final loss function of HE-CycleGAN is defined as Equation (9), where α and β control the weight coefficients of different loss functions. Based on experience, we set alpha and beta to 10.0 and 5.0, respectively [1,25].
L HE - CycleGAN G photo ,   G sketch ,   D photo ,   D sketch ,   D photo - edge ,   D sketch - edge = L P - cycleGAN ( G photo ,   G sketch ,   D photo ,   D sketch ) + L E - cycleGAN ( G photo , G sketch , D photo - edge , D sketch - edge ) + α * L cycle G photo , G sketch + β * L color - id G photo , G sketch
Therefore, the overall optimization objective of the HE-CycleGAN network can be expressed as Equation (10):
G photo ,   G sketch = arg min G photo , G sketch   max D photo , D sketch , D photo - edge , D sketch - edge L HE - CycleGAN G photo ,   G sketch ,   D photo ,   D sketch ,   D photo - edge ,   D sketch - edge

4. Experiments

4.1. Datasets

We evaluated the performance of HE-CycleGAN on the CUHK_student [12,26] dataset, the XM2VTS [27] dataset, and the AR [28] dataset in the CUHK Face Sketch Database (CUFS). The CUHK_student dataset contains 188 facial images from the student database of the CUHK, each with artist-drawn sketch images. The XM2VTS dataset contains 295 facial images, each with a corresponding sketch image. The characters in this dataset cover multiple races and have significant age differences. The AR dataset contains 123 sets of facial-image-sketch samples. The image sizes in all three datasets are 250 × 200. We selected 100 pairs of samples for training and 88 pairs of samples for testing in the CUHK_student dataset. In total, 195 pairs of samples were selected for training and 100 pairs of samples were tested in the XM2VTS dataset. In total, 80 pairs of samples were selected for training and 43 pairs of samples were tested in the AR dataset. Table 1 lists the characteristics of the different datasets. To adapt to the input of the network, we set the size of all images to 256 × 256 and normalized pixel values to between [−1, 1].

4.2. Experimental Procedure

We used the Adam optimizer to update the network weights. The number of training epochs was 500, the batch size was set to 1, and the learning rate of the generator and discriminator was 0.0002 in the first 100 iterations, gradually decreasing to 0 after 100 iterations. The GPU model was NVIDIA Tesla T4, the graphics memory was 16GB, the operating system was Ubuntu 18.04, and the CUDA version was 11.8. The proposed model was implemented in Python version 3.10 of Python.
We used three commonly used evaluation metrics in the field of image conversion to evaluate the objective quality of the generated images: Structural Similarity (SSIM) [29], Learned Perceptual Image Patch Similarity (LPIPS) [30], and Fréchet inception distance (FID) [31]. SSIM is used to measure the similarity between generated images and real images, including luminance, contrast, and structure. Its range is between [0, 1], and the closer SSIM is to 1, the more similar the generated image is to the real image. LPIPS evaluates the perceptual differences between generated images and real images through deep learning models, and LPIPS is more in line with human perception. The lower the value of LPIPS, the more similar the two images are. FID is a measure of the distance between the feature vectors of generated images and real images. The lower the FID, the more similar the two sets of images are. The calculation formulas for SSIM, LPIPS, and FID are (11), (12), and (13), respectively.
SSIM x ,   y = 2 μ x μ y + c 1 σ xy +   c 2 μ x 2 +   μ y 2 +   c 1 σ x 2 +   σ y 2 + c 2
In Equation (11), x and y represent the generated image and the real image, μ x and u y represent the mean of the generated image and the real image, σ x and σ y represent the variance of the generated image and the real image, and σ xy represents the covariance of the generated image and the real image. The parameters c1 and c2 are numerical constants, and their function is to avoid the denominator being 0 in Equation (11). The parameters c1 and c2 can be calculated using Equation (12),
c 1 = ( k 1 L ) 2 , c 2 = ( k 2 L ) 2
where k1 = 0.01, k2 = 0.03, and L = 255.
LPIPS x ,   y = l 1 H l W l h , w ω l x ^ hw l   y ^ hw l 2 2
In Equation (13), x and y are the generated image and the real image, respectively. x ^ hw l and y ^ hw l are the l-th layer features extracted from the generated image x and the real image y using VGG19 [32]. ω l is the scaling parameter, and 2 2 is the square of the L 2 norm.
FID x ,   y = μ x μ y 2 2 + Tr Σ x   + Σ y 2 Σ x Σ y 1 2
In Equation (14), x and y represent the generated image and the real image, respectively, and Inception V3 [33] is used to extract the feature vectors of the generated image and the real image, respectively. μ x and u y are the means of the feature vectors of the generated image and the real image, Σ x and Σ y are the covariance matrices of the generated image and the real image features, Tr represents the trace of the matrix, and 2 2 is the square of the L 2 norm.

4.3. Result Analysis

We conducted comparative experiments between the proposed method and existing methods: Pix2Pix [5], CycleGAN [1], CSGAN [2], CDGAN [3], and LSCIT [4]. Figure 8, Figure 9 and Figure 10 show the facial images generated by different methods on three datasets, respectively.
As shown in the first line of Figure 8, the result generated by Pix2Pix shows very blurry hair strands in the bangs area. In contrast, CycleGAN, CSGAN, CDGAN, and LSCIT can generate hair strands in the bangs area, but the generated hair strands are relatively sparse. In addition, eyebrows generated by other methods are relatively sparse. The hair strands in the bangs in the image generated by the proposed HE-CycleGAN are closer to those of the sketch image, and the eyebrows are also more similar to the sketch image. In the second row of Figure 8 (Figure 9 shows the enlarged results of these image), the result generated by Pix2Pix does not include the double eyelids in the sketch image. Although CycleGAN, CSGAN, CDGAN, and LSCIT can generate double eyelids, the generated double eyelids are relatively blurry. The image generated by HE-CycleGAN has the clearest double eyelids.
As shown in the first and second lines of Figure 10, Pix2Pix does not generate hair strands at the position of the red circle. However, CycleGAN, CSGAN, CDGAN, and LSCIT can generate hair strands at the position of the red circle. But the generated hair strands are relatively blurry. The proposed HE-CycleGAN generates hair strands that are closer to the sketch images and have greater clarity. As shown in the first and second lines of Figure 11, the hairs generated by the comparison method are relatively sparse. The proposed HE-CycleGAN generates hairs that are closer to that in the sketch image. In the third row (Figure 12 shows its enlarged result), the double eyelids generated by Pix2Pix, CycleGAN, CSGAN, CDGAN, and LSCIT are relatively blurry. The proposed HE-CycleGAN generates images with clearer double eyelids. As shown in the red circle in Figure 11, there are artifacts in the faces generated by CycleGAN and LSCIT.
In lines three and four of Figure 8 and lines three, four, and five of Figure 10, the facial edges generated by Pix2Pix are relatively reasonable, but overall, they are very blurry. In contrast, the faces generated by CycleGAN, CSGAN, CDGAN, and LSCIT have seen significant improvements, but there is an issue with facial edge overflow at the position of the red box. The proposed HE-CycleGAN generates facial edges that are more reasonable.
In summary, the facial images generated by our proposed HE-CycleGAN have clearer details and reasonable edges, which visually match the meaning expressed by the facial sketch images better.
Table 2, Table 3 and Table 4 show the quantitative comparison results of different methods on the CUHK_student, XM2VTS, and AR datasets, respectively. Our proposed HE-CycleGAN method has shown improvements in three metrics: Structural Similarity (SSIM), Learned Perceptual Image Patch Similarity (LPISP), and Fréchet inception distance (FID).

4.4. Ablation Studies

Table 5, Table 6 and Table 7 show the results of ablation studies conducted on three datasets. We used the original CycleGAN [1] as the baseline. As seen from the results in Table 5, Table 6 and Table 7, HE-CycleGAN achieved the highest SSIM, LPIPS, and FID values. The SSIM, LPIPS, and FID values of HE-CycleGAN with the HFFE removed and HE-CycleGAN with the MSWED removed were lower than those of HE-CycleGAN, but higher than the original CycleGAN. The above results demonstrate the effectiveness of the HFFE and the MSWED in HE-CycleGAN.
To verify the effectiveness of the channel attention (ECANet [24]) used in the HFFE module. We added HFFE modules and HFFE modules without ECANet to CycleGAN [1] for ablation studies. Table 8, Table 9 and Table 10 show the results of verifying the effectiveness of ECANet on the CUHK_student, XM2VTS, and AR datasets. From the results in Table 8, Table 9 and Table 10, it can be seen that introducing ECANet into the HFFE module is effective and can improve the performance of the model.

5. Conclusions

In this paper, we proposed HE-CycleGAN, a new network for converting facial sketch images to facial images. We added a high-frequency feature extractor (HFFE) to the generator of HE-CycleGAN, which can retain more detailed features. We also designed a multi-scale wavelet edge discriminator (MSWED). This MSWED can better constrain facial edges and avoid the problem of facial edge overflow. The experimental results on the CUHK_student dataset, XM2VTS dataset, and AR dataset show that the SSIM value of the proposed HE-CycleGAN is about 2% higher than that of the state-of-the-art method (LSCIT), while the LPIPS and FID values are 2% and 16% lower, respectively, than the state-of-the-art method (LSCIT). The proposed HE-CycleGAN has the following potential applications, such as converting facial photos into sketch images, or, after training with other styles of images, HE-CycleGAN can convert facial images into cartoons or other styles of images. Due to the additional detail features provided by wavelet transform for the HE-CycleGAN generator, we speculate that a very small number of facial images generated by HE-CycleGAN may exhibit excessive details when faced with a large number of generation demands. Excessive detailing refers to the presence of excessive hair and beard details in the generated facial images. In the next step of our research, we will first test the proposed HE-CycleGAN through a large number of actual images. If there is excessive detailing, we will investigate how to adaptively adjust the number of detailed features provided to the generator to avoid possible situations of over-detailing.

Author Contributions

Conceptualization, methodology, B.L.; software, validation, visualization, R.D.; writing—original draft preparation, writing—review and editing, R.D., J.L., and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the research project of the Jilin Provincial Department of Education under Grant no. JJKH20240153KJ.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This article is not about human research.

Data Availability Statement

The data presented in this study are openly available; reference number [14,21,22,23].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhu, J.Y.; Park, T.; Isola, P. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
  2. Babu, K.K.; Dubey, S.R. CSGAN: Cyclic-synthesized generative adversarial networks for image-to-image transformation. Expert Syst. Appl. 2021, 169, 114431. [Google Scholar] [CrossRef]
  3. Babu, K.K.; Dubey, S.R. Cdgan: Cyclic discriminative generative adversarial networks for image-to-image transformation. J. Vis. Commun. Image Represent. 2022, 82, 103382. [Google Scholar] [CrossRef]
  4. Wang, G.; Shi, H.; Chen, Y.; Wu, B. Unsupervised image-to-image translation via long-short cycle-consistent adversarial networks. Appl. Intell. 2023, 53, 17243–17259. [Google Scholar] [CrossRef]
  5. Isola, p.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
  6. Senapati, R.K.; Satvika, R.; Anmandla, A.; Ashesh Reddy, G.; Anil Kumar, C. Image-to-image translation using Pix2Pix GAN and cycle GAN. In International Conference on Data Intelligence and Cognitive Informatics; Springer Nature Singapore: Singapore, 2023; pp. 573–586. [Google Scholar]
  7. Zhang, Y.; Yu, L.; Sun, B.; He, J. ENG-Face: Cross-domain heterogeneous face synthesis with enhanced asymmetric CycleGAN. Appl. Intell. 2022, 52, 15295–15307. [Google Scholar] [CrossRef]
  8. Chu, C.; Zhmoginov, A.; Sandler, M. Cyclegan, a master of steganography. arXiv 2017, arXiv:1712.02950. [Google Scholar]
  9. Porav, H.; Musat, V.; Newman, P. Reducing Steganography In Cycle-consistency GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 78–82. [Google Scholar]
  10. Gao, Y.; Wei, F.; Bao, J.; Gu, S.; Chen, D.; Wen, F.; Lian, Z. High-fidelity and arbitrary face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 16115–16124. [Google Scholar]
  11. Lin, C.T.; Kew, J.L.; Chan, C.S.; Lai, S.H.; Zach, C. Cycle-object consistency for image-to-image domain adaptation. Pattern Recognit. 2023, 138, 109416. [Google Scholar] [CrossRef]
  12. Wang, X.; Tang, X. Face photo-sketch synthesis and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 1955–1967. [Google Scholar] [CrossRef] [PubMed]
  13. Xiao, B.; Gao, X.; Tao, D.; Li, X. A new approach for face recognition by sketches in photos. Signal Process. 2009, 89, 1576–1588. [Google Scholar] [CrossRef]
  14. Bono, F.M.; Radicioni, L.; Cinquemani, S.; Conese, C.; Tarabini, M. Development of soft sensors based on neural networks for detection of anomaly working condition in automated machinery. In Proceedings of the NDE 4.0, Predictive Maintenance, and Communication and Energy Systems in a Globally Networked World, Long Beach, CA, USA, 4–10 April 2022; pp. 56–70. [Google Scholar]
  15. Zhang, L.; Lin, L.; Wu, X.; Ding, S.; Zhang, L. End-to-end photo-sketch generation via fully convolutional representation learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, Shanghai, China, 23–26 June 2015. [Google Scholar]
  16. Zhou, G.; Fan, Y.; Shi, J.; Lu, Y.; Shen, J. Conditional generative adversarial networks for domain transfer: A survey. Appl. Sci. 2022, 12, 8350. [Google Scholar] [CrossRef]
  17. Porkodi, S.P.; Sarada, V.; Maik, V.; Gurushankar, K. Generic image application using gans (generative adversarial networks): A review. Evol. Syst. 2023, 14, 903–917. [Google Scholar] [CrossRef]
  18. Li, Y.; Chen, X.; Wu, F.; Zha, Z.J. Linestofacephoto: Face photo generation from lines with conditional self-attention generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2323–2331. [Google Scholar]
  19. Chen, S.Y.; Su, W.; Gao, L.; Xia, S.; Fu, H. Deep generation of face images from sketches. arXiv 2020, arXiv:2006.01047. [Google Scholar]
  20. Li, L.; Tang, J.; Shao, Z.; Tan, X.; Ma, L. Sketch-to-photo face generation based on semantic consistency preserving and similar connected component refinement. Vis. Comput. 2022, 38, 3577–3594. [Google Scholar] [CrossRef]
  21. Sun, J.; Yu, H.; Zhang, J.J.; Dong, J.; Yu, H.; Zhong, G. Face image-sketch synthesis via generative adversarial fusion. Neural Netw. 2022, 154, 179–189. [Google Scholar] [CrossRef] [PubMed]
  22. Shao, X.; Qiang, Z.; Dai, F.; He, L.; Lin, H. Face Image Completion Based on GAN Prior. Electronics 2022, 11, 1997. [Google Scholar] [CrossRef]
  23. Ren, G.; Geng, W.; Guan, P.; Cao, Z.; Yu, J. Pixel-wise grasp detection via twin deconvolution and multi-dimensional attention. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4002–4010. [Google Scholar] [CrossRef]
  24. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  25. Gao, G.; Lai, H.; Jia, Z. Unsupervised image dedusting via a cycle-consistent generative adversarial network. Remote Sens. 2023, 15, 1311. [Google Scholar] [CrossRef]
  26. Zhang, W.; Wang, X.; Tang, X. Coupled information-theoretic encoding for face photo-sketch recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 20–25 June 2011. [Google Scholar]
  27. Koch, B.; Grbić, R. One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information. Image Vis. Comput. 2024, 142, 104900. [Google Scholar] [CrossRef]
  28. Liu, F.; Chen, D.; Wang, F.; Li, Z.; Xu, F. Deep learning based single sample face recognition: A survey. Artif. Intell. Rev. 2023, 56, 2723–2748. [Google Scholar] [CrossRef]
  29. Rajeswari, G.; Ithaya Rani, P. Face occlusion removal for face recognition using the related face by structural similarity index measure and principal component analysis. J. Intell. Fuzzy Syst. 2022, 42, 5335–5350. [Google Scholar] [CrossRef]
  30. Ko, K.; Yeom, T.; Lee, M. Superstargan: Generative adversarial networks for image-to-image translation in large-scale domains. Neural Netw. 2023, 162, 330–339. [Google Scholar] [CrossRef]
  31. Kynkäänniemi, T.; Karras, T.; Aittala, M.; Aila, T.; Lehtinen, J. The role of imagenet classes in fréchet inception distance. arXiv 2022, arXiv:2203.06026. [Google Scholar]
  32. Song, Z.; Zhang, Z.; Fang, F.; Fan, Z.; Lu, J. Deep semantic-aware remote sensing image deblurring. Signal Process. 2023, 211, 109108. [Google Scholar] [CrossRef]
  33. Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Figure 1. Edge overflow phenomenon in existing methods for generating facial images.
Figure 1. Edge overflow phenomenon in existing methods for generating facial images.
Symmetry 16 01015 g001
Figure 2. Results generated by Pix2Pix [5] and CycleGAN [1] trained with non-aligned data.
Figure 2. Results generated by Pix2Pix [5] and CycleGAN [1] trained with non-aligned data.
Symmetry 16 01015 g002
Figure 3. Results generated by CycleGAN [1]: (a) input facial sketch image; (b) the generated facial image; (c) image reconstructed using (b).
Figure 3. Results generated by CycleGAN [1]: (a) input facial sketch image; (b) the generated facial image; (c) image reconstructed using (b).
Symmetry 16 01015 g003
Figure 4. The architecture of HE-CycleGAN.
Figure 4. The architecture of HE-CycleGAN.
Symmetry 16 01015 g004
Figure 5. The architecture of the generator.
Figure 5. The architecture of the generator.
Symmetry 16 01015 g005
Figure 6. The structure of the HFFE.
Figure 6. The structure of the HFFE.
Symmetry 16 01015 g006
Figure 7. The architecture of the MSWED.
Figure 7. The architecture of the MSWED.
Symmetry 16 01015 g007
Figure 8. Qualitative comparison results of different methods on the CUHK_student dataset.
Figure 8. Qualitative comparison results of different methods on the CUHK_student dataset.
Symmetry 16 01015 g008
Figure 9. Enlarged results of different comparison methods in the second row of Figure 8.
Figure 9. Enlarged results of different comparison methods in the second row of Figure 8.
Symmetry 16 01015 g009
Figure 10. Qualitative comparison results of different methods on the XM2VTS dataset.
Figure 10. Qualitative comparison results of different methods on the XM2VTS dataset.
Symmetry 16 01015 g010
Figure 11. Qualitative comparison results of different methods on the AR dataset.
Figure 11. Qualitative comparison results of different methods on the AR dataset.
Symmetry 16 01015 g011
Figure 12. Enlarged results of different comparison methods in the third row of Figure 11.
Figure 12. Enlarged results of different comparison methods in the third row of Figure 11.
Symmetry 16 01015 g012
Table 1. Characteristics of different datasets.
Table 1. Characteristics of different datasets.
DatasetsNumber of Sample PairsSizeTrain/Test
CUHK_student188250 × 200100/88
XM2VTS295250 × 200195/100
AR123250 × 20080/43
Table 2. Comparison results of different methods on the CUHK_student dataset.
Table 2. Comparison results of different methods on the CUHK_student dataset.
Pix2Pix [5]CycleGAN [1]CSGAN [2]CDGAN [3]LSCIT [4]Ours
SSIM ⬆0.68660.69380.68370.69160.69440.7118
LPIPS ⬇0.27560.23000.25240.22700.22810.2017
FID ⬇127.660065.565887.014260.407167.264151.4870
Table 3. Comparison results of different methods on the XM2VTS dataset.
Table 3. Comparison results of different methods on the XM2VTS dataset.
Pix2Pix [5]CycleGAN [1]CSGAN [2]CDGAN [3]LSCIT [4]Ours
SSIM ⬆0.58340.59400.59840.59670.60570.6109
LPIPS ⬇0.24810.23710.24260.24530.24520.2207
FID ⬇66.513547.124558.019847.151350.833441.2961
Table 4. Comparison results of different methods on the AR dataset.
Table 4. Comparison results of different methods on the AR dataset.
Pix2Pix [5]CycleGAN [1]CSGAN [2]CDGAN [3]LSCIT [4]Ours
SSIM ⬆0.68360.68010.69300.68300.68160.7048
LPIPS ⬇0.24230.25960.22760.25850.25290.2128
FID ⬇99.039492.043677.008477.553374.733751.8288
Table 5. The results of ablation studies on the CUHK_student dataset.
Table 5. The results of ablation studies on the CUHK_student dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
CycleGAN [1]0.69380.230065.5658
+HFFE0.70340.208054.0132
+MSWED0.70850.210756.7533
HE-CycleGAN0.71180.201751.4870
Table 6. The results of ablation studies on the XM2VTS dataset.
Table 6. The results of ablation studies on the XM2VTS dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
CycleGAN [1]0.59400.237147.1245
+HFFE0.60040.228843.3593
+MSWED0.60310.225041.9002
HE-CycleGAN0.61090.220741.2961
Table 7. The results of ablation studies on the AR dataset.
Table 7. The results of ablation studies on the AR dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
CycleGAN [1]0.68010.259692.0436
+HFFE0.70290.216952.6087
+MSWED0.69780.224064.8579
HE-CycleGAN0.70480.212851.8288
Table 8. The results of verifying the validity of ECANet on the CUHK_student dataset.
Table 8. The results of verifying the validity of ECANet on the CUHK_student dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
+HFFE0.70340.208054.0132
HFFE − ECANet [24]0.69690.211556.8686
Table 9. The results of validating the effectiveness of ECANet on the XM2VTS dataset.
Table 9. The results of validating the effectiveness of ECANet on the XM2VTS dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
+HFFE0.60040.228843.3593
HFFE − ECANet [24]0.59780.234044.9255
Table 10. The results of validating the effectiveness of ECANet on the AR dataset.
Table 10. The results of validating the effectiveness of ECANet on the AR dataset.
MethodSSIM ⬆LPIPS ⬇FID ⬇
+HFFE0.70290.216952.6087
HFFE − ECANet [24]0.69710.220356.2123
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, B.; Du, R.; Li, J.; Tang, Y. HE-CycleGAN: A Symmetric Network Based on High-Frequency Features and Edge Constraints Used to Convert Facial Sketches to Images. Symmetry 2024, 16, 1015. https://doi.org/10.3390/sym16081015

AMA Style

Li B, Du R, Li J, Tang Y. HE-CycleGAN: A Symmetric Network Based on High-Frequency Features and Edge Constraints Used to Convert Facial Sketches to Images. Symmetry. 2024; 16(8):1015. https://doi.org/10.3390/sym16081015

Chicago/Turabian Style

Li, Bin, Ruiqi Du, Jie Li, and Yuekai Tang. 2024. "HE-CycleGAN: A Symmetric Network Based on High-Frequency Features and Edge Constraints Used to Convert Facial Sketches to Images" Symmetry 16, no. 8: 1015. https://doi.org/10.3390/sym16081015

APA Style

Li, B., Du, R., Li, J., & Tang, Y. (2024). HE-CycleGAN: A Symmetric Network Based on High-Frequency Features and Edge Constraints Used to Convert Facial Sketches to Images. Symmetry, 16(8), 1015. https://doi.org/10.3390/sym16081015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop