1. Introduction
In the ever-evolving landscape of modern computer vision, image generation has emerged as a prominent and extensively explored research domain. Drawing upon the capabilities of artificial intelligence and machine learning, the field of image generation is dedicated to crafting images that exhibit realistic appearances and a rich diversity of features. Particularly noteworthy is the pervasive influence of Japanese anime across various domains, igniting profound interest in anime face generation. Within the expansive realm of image generation, anime face generation has carved out a distinctive niche, capturing substantial attention from the realms of academia and industry alike. The allure of breathing life into anime-inspired visages has ignited a fervent pursuit of innovation and excellence in this specialized subfield.
The realm of machine learning and neural networks has witnessed unprecedented growth, with numerous applications such as asset allocation [
1] and non-negative matrix factorization [
2]. These cutting-edge technologies have found utility in diverse domains, from natural language processing to medical diagnostics. The profound impact of AI and machine learning has been especially pronounced in image generation, where they play a pivotal role in creating images with lifelike qualities and intricate features. The fusion of AI and machine learning has unleashed a wave of innovation, propelling the field of image generation into new horizons.
With the ongoing advancements in deep learning, particularly with the emergence of generative adversarial networks (GANs) [
3], significant progress has been achieved in image generation. Among the various GAN architectures, deep convolutional GANs (DCGANs) [
4], based on deep convolutional networks [
5], have demonstrated exceptional image-processing capabilities and have succeeded in generating high-quality images. However, despite their achievements, DCGANs still face certain limitations, particularly when handling long-range dependencies that necessitate passing through several convolutional layers due to their local receptive fields.
To address these concerns and further enhance the quality of image generation, self-attention GANs (SAGANs) [
6] were developed. SAGANs incorporate the self-attention mechanism [
7], enabling it to effectively capture long-range dependencies while retaining the advantages of DCGANs. This fusion of self-attention architecture with DCGANs has demonstrated promising potential for various image generation tasks. Nevertheless, when applied specifically to the generation of anime faces, SAGANs encounter certain challenges.
Notably, the generated anime images still exhibit certain shortcomings, such as blurry edges and distorted facial features. This indicates the need for further improvements to achieve even higher image quality in anime face generation. Additionally, although SAGANs propose some techniques to mitigate issues like mode collapse, a well-known problem in GAN training, where the model collapses to generating a limited variety of samples, there is still room for improvement to ensure stable and diverse image generation results. Furthermore, during the training process, the performance of the SAGAN tends to deteriorate. One possible explanation is the phenomenon of catastrophic forgetting, which describes a sharp decline in the model’s performance on previously learned tasks as soon as a new task is introduced.
To address the challenge of generating more detailed anime images, researchers have developed CartoonGAN [
8]. This model constructs its own blurry dataset using methods from digital image processing (DIP), which includes the canny edge-detection algorithm and Gaussian blur. Additionally, it combines the traditional loss function with the edge-promoting adversarial loss function to encourage the model to avoid generating images with blurry edges. Moreover, AnimeGAN [
9] employs additional techniques to create fake datasets containing a broader range of negative samples. However, all these anime-related works primarily focus on Image style transfer [
10], and there is a lack of powerful methods for generating anime faces from random noise. To tackle the issues of mode collapse and catastrophic forgetting, the closed-loop memory GAN [
11] introduces a memory structure that assists the model in learning from past data, thereby improving the variety of generated pictures. This approach shows promise in addressing these challenges effectively.
In this work, we take a two-fold approach to address the challenges in anime face generation. First, we introduce a regularization parameter into the optimizer of SAGANs to mitigate mode collapse. Secondly, we present SAGANs with blur and memory, abbreviated as BaMSGAN. This innovative model incorporates both a blur dataset and a memory structure into the self-attention GAN framework. The self-attention structure plays a crucial role in enabling the model to capture features from all positions and learn long-range dependencies. Additionally, we introduce a blurry dataset, randomly sampled from the original dataset and processed using digital image processing (DIP) techniques. This dataset is used as negative data in the discriminator to encourage the model to avoid generating images with blurry edges. The memory structure is designed to store previously generated images, which are then utilized to assist the discriminator in learning from the model’s historical output. Extensive experiments were conducted on an anime face dataset [
12] comprising 63,566 images at a resolution of 64 × 64 pixels. The results confirm the effectiveness of the proposed model, BaMSGAN, which significantly outperforms previous methods in anime face generation. It generates more detailed images at a faster pace, with a noticeable reduction in distortion rates after each epoch.
3. BaMSGAN
A typical generative adversarial network comprises two convolutional neural networks: the discriminator
and the generator
. The generator generates fake images from random noise, while the discriminator classifies whether the input image comes from the real dataset or is generated by the generator (i.e., real or fake). As training progresses, the discriminator
becomes more proficient at distinguishing between real and fake images, while the generator
improves its ability to deceive the discriminator by producing high-quality images. Obviously, our final goal is to solve this problem:
Several earlier modules, such as Cartoon GAN and Anime GAN, have achieved great success in style transfer tasks. However, tasks like generating anime images from noise and anime face generation have been largely overlooked. Building upon the foundation of self-attention generative adversarial networks, we have adapted our module to suit these tasks. We formulate a generating function that takes the original anime manifold
, the edge-blurred anime manifold
, and eventually generates a set of fake anime manifold
in the same style. The generating function is trained using the original anime dataset
and the edge-blurred anime dataset
. In
Section 4, we will present in detail the four key components and the overall structure of our architecture.
3.1. SAGAN with Regularization
SAGAN, or self-attention generative adversarial network, is an advanced architecture within the realm of generative adversarial networks. It introduces self-attention mechanisms to enhance the consistency and quality of fine details in generated images. By employing self-attention, SAGAN gains the ability to grasp global structures and dependencies within images during the generation process, resulting in the production of more lifelike and nuanced images. Given these capabilities, we opted to apply SAGAN to the task of anime face generation.
However, our initial application of the standard SAGAN architecture to anime face generation revealed a pressing issue: mode collapse. In this situation, the generator begins to produce images with patterns that are strikingly similar, causing a significant reduction in diversity among the generated samples. In response to this challenge, we embarked on a journey to regularize our optimization process, ultimately aiming to prevent overfitting and reinvigorate the generation of anime face images with a rich variety of patterns.
Through numerous iterations and experiments, we arrived at a regularization parameter of 0.0001 as an effective solution. This regularization procedure is a critical component in the enhancement of image generation in our model. Below, we provide visual representations of the anime face samples before and after the application of regularization.
Mode collapse, a recurrent issue in GANs, occurs when the generator becomes fixated on generating a restricted set of patterns or samples, thereby limiting the diversity in the generated images. Regulation optimization techniques [
29], such as adding regularization terms to the loss function [
30], offer a means to counteract mode collapse. Methods like weight decay (L2 regularization) or gradient penalties serve to encourage smoother behaviors in both the generator and discriminator networks. As a result, this regularization fosters a more stable training process, better convergence, and a marked reduction in mode collapse occurrences. The samples of original SAGAN (mode collapse) and samples after regularization are shown in
Figure 1.
3.2. Edge Blur
To enhance image clarity, we drew inspiration from the Cartoon GAN’s architecture. We incorporated a GAN module that would transform real images into a cartoon style. Our strategy involved blurring the edges of the input images and integrating them into our loss function. The intention was for the discriminator, during training, to classify these edge-blurred images as ‘fake’, thus encouraging the generator to produce sharper images. This approach was a resounding success.
In the preprocessing phase, we selected Gaussian blur in conjunction with Canny edge detection to address the challenge of generating more distinct anime pictures and improving overall image clarity. A subset of the training dataset underwent a blurring preprocessing step. This step enabled the discriminator to better recognize edge-blurred images as “fake”, ultimately instructing the generator to create clearer images.
To create a database of edge-blurred images for training, we employed Canny edge detection to outline image edges and applied Gaussian blur to them. This database was then utilized to train the discriminator. The primary goal of this approach was to compel the model to avoid generating images with blurry edges, resulting in anime faces with significantly improved clarity. The algorithm of Edge Blur is shown in Algorithm 1 as pseudo code. An example of Edge Blur is shown in
Figure 2a,b.
Algorithm 1 Algorithm of Edge Blur |
Input: , , , k |
Output: |
1: | ⟵ |
2: | ⟵(,,){Use canny to get the outline of the edges} |
3: | for each do |
4: | for each do |
5: | if =0 then |
6: | []⟵[0,0,0] |
7: | end if |
8: | end for |
9: | end for |
10: | =(, k){Apply GaussianBlur to the edges outlines previously, k is the kernal size} |
11: | for each do |
12: | for each do |
13: | if !=0 then |
14: | []⟵[]{Replace the original edge with the GaussianBlur version} |
15: | end if |
16: | end for |
17: | end for |
18: | return
|
3.3. Memory
We have established a memory repository with a predefined capacity to optimize storage efficiency. In the GAN training process, we follow a deliberate approach. Initially, we allow the GAN to train until it reaches a state of convergence. This strategy is employed because the quality of images generated by the GAN at the outset of training tends to be subpar. These initial images lack the quality necessary to significantly challenge the discriminator, and memory allocation consumes valuable time and resources. Thus, we have a stipulation that, once the GAN has reached a stage where it can generate high-quality images, our memory repository commences the storage of a selection of these generator-produced images.
When the memory repository reaches its maximum capacity, we adopt a policy of random sampling and deletion to manage its contents. This periodic removal of images ensures that the memory repository remains a dynamic and adaptive source of historical data. The images stored in this memory repository are subsequently utilized as fake data in the training process, provided to the discriminator. In doing so, we integrate historical data into the training of the GAN, effectively improving its overall performance.
This memory repository approach helps address issues related to mode collapse and catastrophic forgetting while also contributing to the training of the GAN using a more diverse and comprehensive dataset. Additionally, it allows for the generation of anime faces with improved clarity and fidelity.
3.4. Loss Function
Our loss function comprises three key components. The first component is the adversarial loss, a standard feature in adversarial generative networks. Here, the generator G is tasked with producing anime avatars from random noise z, which is drawn from a Gaussian distribution . The goal is for these generated avatars to deceive the discriminator D, which, in turn, must accurately distinguish between anime avatars in the real anime face dataset and those created by the generator.
The second component involves the edge loss. We prepare an edge-blurred dataset in advance, where M represents the total number of blurred images. These blurred images are used as fake inputs, allowing our GAN to place a stronger emphasis on capturing the edge features, which are particularly relevant to anime avatars.
The third component is the memory loss. We augment the training process by using anime avatars generated by a segment of the generator as fake inputs and store them in a memory dataset . This dataset captures the historical data, allowing our GAN to learn from previous tasks and avoid deteriorating with extended training periods.
Our loss function is formulated as follows:
3.5. Our Model
In our model, we harness the power of self-attention mechanisms, a cornerstone of the BaMSGAN, to significantly enhance the generation of anime faces. Self-attention introduces a critical element to our architecture, enabling the model to capture long-range dependencies and intricate spatial relationships within the input images. By implementing self-attention, our BaMSGAN can simultaneously focus on different regions of the image, effectively capturing the global contextual information. This capability empowers both the generator and discriminator to discern complex structures and connections within anime faces, a task often challenging for traditional convolutional architectures.
The integration of self-attention is pivotal for improving the quality of our generated anime faces. It ensures that the model comprehends and synthesizes anime features more cohesively, resulting in sharper and more realistic images. Furthermore, it enhances the discriminator’s ability to distinguish between real and generated images, thereby facilitating a more robust and stable training process. The inclusion of self-attention in BaMSGAN underscores our commitment to overcoming the limitations of earlier GAN architectures. It plays a central role in elevating the capabilities of our model, ultimately leading to superior anime face generation and the establishment of new benchmarks in the field.
As for our training approach, we first extract a portion of the samples from the original dataset for edge-blur processing to create a blur dataset. Subsequently, the images from the original dataset serve as positive examples, while the generated images and images from the blur dataset are used as negative examples during training. Once the training converges, the images generated from the sampled portion are preserved as a memory dataset.
During the training process, we continually sample and randomly remove some memory data, which then serves as negative examples to further enhance the training of the BaMSGAN. The whole training process and model architecture are shown in
Figure 3.
4. Experiments
We trained and implemented our module in Torch and Python. All our experiments were conducted using an NVIDIA RTX 4090 GPU. The BaMSGAN is capable of generating high-quality animated images and does not demand extensive computing resources or time. With less than an hour of training on the RTX 4060, it can produce relatively high-quality animated images. The results generated by the BaMSGAN are displayed in
Figure 4.
Our baselines include Wasserstein GANs with gradient penalty (WGAN-GP) and DCGANs. Initially, we will compare the BaMSGAN with the WGAN-GP and DCGAN to showcase the BaMSGAN’s performance. Subsequently, we conduct ablation experiments involving a SAGAN with regularization, a SAGAN with regularization and edge-blurred images, and the BaMSGAN to assess the effectiveness of each component within our loss function.
During the training of these modules, we extracted samples at both 30 epochs and 300 epochs. The results at 30 epochs represent the short-term performance of these modules, while the results at 300 epochs reflect their long-term performance.
4.1. Data
The dataset we use contains approximately 63,565 anime face images obtained from GitHub [
12]. All images have been resized to 64 × 64 pixels. The blurred database is one-tenth the size of the dataset, comprising roughly 6000 anime faces with blurred edges.
4.2. Comparison with Prior Work in Anime Face Generation
To confirm the BaMSGAN’s superiority in anime face generation, we compare the generated results of the BaMSGAN, DCGAN, and WGAN-GP following training. The results are showcased in
Figure 5. In contrast to the DCGAN, our model significantly enhances the diversity of generated anime faces. The results reveal that the DCGAN, both in the short run and the long run, is plagued by a severe problem of mode collapse, producing nearly identical and low-quality faces. Conversely, our model produces clear faces with diverse facial features such as hair color and facial expressions, creating more vivid characters.
Compared to the WGAN-GP, which generates severely distorted and blurred faces, our model produces anime faces with well-defined edges (evident in the hair curves) and natural eyes, noses, and mouths. The inclusion of edge-blurred images in the loss function sharpens the edges, enhancing the clarity of anime faces. Additionally, the memory loss in the loss function contributes to normalizing facial features, as the discriminator learns to recognize the twisted faces generated during earlier epochs as fake images.
Moreover, we also compare the DR (distortion rate: the number of distorted faces/the number of all the faces in every epoch) between our model and the SAGAN at epoch 290 to epoch 298. Compared to the SAGAN, our model can also effectively improves the quality of images in the later stages of training and possesses the ability for continuous learning.
These results underscore the effectiveness of BaMSGAN in overcoming the limitations observed in other GAN architectures, producing anime faces with greater clarity and fidelity.
Table 1 presents a comparison of the distortion rate (DR) between our BaMSGAN model and the SAGAN baseline at epochs 290 to 298 during the experiment.
4.3. Ablation Experiment about the Loss Function
To understand the individual functions and roles of each component within the loss function, we conducted a series of ablation experiments. We systematically introduced the edge-blurred image loss and the memory loss into the original GAN loss function (already incorporating regularization), one at a time, and compared the results with the BaMSGAN. The outcomes of these experiments are visualized in
Figure 6.
The experimental results clearly demonstrate that the BaMSGAN consistently achieves superior performance, both in the short-term and long-term training, producing anime avatars that closely resemble real characters. In the edge-only group experiment, it’s evident that the image quality in the short-term is significantly lower than the BaMSGAN, and there’s a noticeable deterioration in image quality over prolonged training. In the memory-only group experiment, it becomes apparent that without the additional training facilitated by the edge-blur loss, the BaMSGAN struggles to generate higher-quality images, with some images appearing missing or blurred.
These ablation experiments highlight the indispensable role played by each component in our loss function, emphasizing their collective contribution to the BaMSGAN’s remarkable performance in anime face generation.
5. Conclusions and Future Work
In this paper, we employ a SAGAN enhanced by historical information and edge blurring for anime face generation. Our model introduces regularization operations to the original SAGAN to enhance the stability of SAGAN training. We selected a subset of images from the source dataset, applied edge detection using the Canny algorithm, and used Gaussian blurring to create a blurred dataset. Additionally, we saved a portion of the images generated during the training process to construct a historical dataset. These two datasets are then employed as fake data to bolster training, resulting in improved performance in anime face generation. Experimental results demonstrate that our model outperforms DCGAN, WGAN, and SAGAN, generating clearer, higher-quality images and converging more rapidly.
In our future work, we aim to explore additional methods to enhance GAN performance in specific tasks through modifications to the loss function. We will also test our model on a wider range of datasets to enhance its generalization capabilities. Furthermore, we intend to investigate the potential impact of the blurred dataset on the overall image generation style.