SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN

Xiao, Zhiguo; Zhang, Dong; Chen, Xianqing; Li, Dongni

doi:10.3390/electronics14030428

Open AccessArticle

SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN

¹

School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100811, China

²

College of Computer Science and Technology, Changchun University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 428; https://doi.org/10.3390/electronics14030428

Submission received: 21 December 2024 / Revised: 17 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

The wireless capsule endoscope (WCE) has been utilized for human digestive tract examinations for over 20 years. Given the complex environment of the digestive tract and the challenge of detecting multi-category lesion images, enhancing model generalization ability is crucial. However, traditional data augmentation methods struggle to generate sufficiently diverse data. In this study, we propose a novel generative adversarial network, Special Common Attention Generative Adversarial Network (SCAGAN), to generate lesion images for capsule endoscopy. The SCAGAN model can adaptively integrate both the internal features and external global dependencies of the samples, enabling the generator to not only accurately capture the key structures and features of capsule endoscopic images, but also enhance the modeling of lesion complexity. Additionally, SCAGAN incorporates global context information to improve the overall consistency and detail of the generated images. To further enhance adaptability, self-modulation normalization is used, along with the Structural Similarity Index (SSIM) loss function to ensure structural authenticity. The Differentiable Data Augmentation (DiffAug) technique is employed to improve the model’s performance in small sample environments and balance the training process by adjusting learning rates to address issues of slow learning due to discriminator regularization. Experimental results show that SCAGAN significantly improves image quality and diversity, achieving state-of-the-art (SOTA) performance in the Frechet Inception Distance (FID) index. Moreover, when the generated lesion images were added to the dataset, the mean average precision (mAP) of the YOLOv9-based lesion detection model increased by 1.495%, demonstrating SCAGAN’s effectiveness in optimizing lesion detection. SCAGAN effectively addresses the challenges of lesion image generation for capsule endoscopy, improving both image quality and detection model performance. The proposed approach offers a promising solution for enhancing the training of lesion detection models in the context of capsule endoscopy.

Keywords:

generative adversarial network; data enhancement; wireless capsule endoscopy; SCAGAN

1. Introduction

Wireless capsule endoscopy (WCE) is renowned worldwide for its patient-centric approach and non-invasiveness. It provides an innovative non-invasive examination method for the complete digestive tract examination of the human body and has been clinically used for more than 20 years [1].

The wireless capsule endoscope is a small, swallowable capsule that contains a built-in camera and a signal transmission system. It travels through the digestive tract using natural peristalsis, capturing images as it moves until it is eventually excreted from the body. Doctors diagnose the patient’s condition by analyzing the collected images. The working time of the wireless capsule endoscope in the human body usually reaches 5–8 h, while the lesion images often only account for less than 1% of the total number of images [2]. It takes doctors at least 3 h to sift through the vast and complex array of images to select the key lesion images. During this time, they often miss or incorrectly identify lesion images, leading to a low efficiency in manual detection. In view of this, many researchers are committed to developing efficient algorithms to automatically identify abnormalities in WCE images and assist clinicians in diagnosing gastrointestinal diseases more accurately and efficiently [2,3,4]. Compared with manual feature extraction and traditional machine learning methods, deep convolutional neural networks (CNNS) have achieved great success in many visual tasks, such as image reconstruction [5,6], object detection [7,8,9] and image generation [10,11].

Deep learning has become a popular method for detecting anomalies in wireless capsule endoscopy (WCE) images. To train an effective lesion detection model, it is essential to have a large, diverse, and balanced dataset. Unfortunately, the limited availability of abnormal WCE images, along with the sensitivity of the data and the expertise required for accurate annotation, makes it challenging to create a comprehensive dataset of lesions. As a result, existing public datasets of wireless capsule endoscopy images often have an imbalanced distribution of different lesion types. For instance, in the Kvasir-Capsule dataset, which is the largest capsule endoscopy dataset [12], healthy images account for 73.93%, while erosion lesions and polyp lesions comprise only 1.09% and 0.12%, respectively. Using such an imbalanced dataset can lead to a lesion detection model that is less sensitive to detecting smaller lesions and may result in overfitting.

Conventional data augmentation techniques, such as flipping, random cropping, and rotation, can alleviate data scarcity to some degree. However, these methods primarily focus on geometric or color transformations of images and often do not take into account the semantic content. As a result, the enhanced images may not sufficiently preserve the critical medical features and structural integrity present in the original images [13]. In this paper, we propose a novel generative adversarial network (SCAGAN) for generating wireless capsule endoscopy lesion images. A series of experiments demonstrate that SCAGAN can effectively augment small sample datasets, addressing the issue of data imbalance in training multiclass capsule endoscopy lesion detection models. The main contributions of this paper are as follows:

A Special Common Attention Net (SCA Net) model was proposed, and the SCAGAN model for generating capsule endoscopy lesion images was designed by combining the existing generative model architecture. This method effectively improved the quality and diversity of the WCE lesion images generated.
By incorporating self-modulated regularization [14] and the DiffAug technique [15], the training of both the generator and discriminator in the proposed model is optimized, leading to significant improvements in model stability and generalization.
We propose and design a Structural Similarity Loss function (SSIM Loss) and demonstrate its effectiveness in improving the convergence speed of model training.
Through both qualitative and quantitative analysis of the generated images, we validated the effectiveness of SCAGAN in generating capsule endoscopic lesion images. These images were then used to train the lesion detection model. Experimental comparisons showed that this approach significantly enhanced the accuracy and robustness of the detection model.

2. Related Work

Generative adversarial networks (GANs) have become a key area of deep learning research since their introduction by Goodfellow et al. [16] in 2014, thanks to their ability to generate realistic data distributions. However, the original GAN often encounters issues such as unstable training and mode collapse [17], primarily due to the use of JS divergence as the loss function. To address these challenges, researchers have introduced numerous improvements to the architecture [18] and loss function [19] of GANs.

To enhance the diversity and controllability of generated samples, Mirza and Osindero [20] introduced the Conditional Generative Adversarial Network (CGAN) in 2014. By incorporating conditional information (such as labels or specific inputs) into the training of both the generator and discriminator, CGAN allows the generator to produce target samples based on predefined conditions. To improve the training stability and generative performance of GANs, Radford et al. [21] proposed the Deep Convolutional Generative Adversarial Network (DCGAN), which leverages convolutional layers, batch normalization, and a tailored loss function to generate high-quality images. This approach enables the network to autonomously learn and generate images, significantly enhancing both image quality and training stability. In 2017, Zhu et al. [22] introduced CycleGAN, which enables image translation tasks without the need for paired training data, significantly broadening the practical applications of generative adversarial networks. In 2018, Progressive GAN (ProGAN) [23] addressed training instability by progressively increasing the resolution of both the generator and discriminator, enabling more efficient training of high-resolution images. SAGAN [24] introduced a self-attention mechanism in the generative adversarial network, which enables the generator to consider the contextual information of other positions in the image when generating each pixel. This mechanism enables the model to better capture global dependencies and further improves the quality of generated images in terms of detail representation and global consistency. Brock et al. proposed BigGAN [25], which greatly improves the resolution and diversity of generated images by using larger models and richer training datasets. In 2019, NVIDIA proposed StyleGAN [26]. By combining the styles of potential spaces with different levels of generators, StyleGAN makes the generated images more realistic, and allows flexible control of every detail of the image (such as facial features, background, etc.). The DALL·E model [27] proposed by OpenAI in 2021 based on generative adduction network and transformer can generate corresponding images according to text description.

Improving the loss function of generative models is another key area of development. For instance, Wasserstein GAN (WGAN) [19] replaces JS divergence with Wasserstein distance, while combining weight clipping and gradient penalty (WGAN-GP) [28] to significantly enhance training stability and generation quality. The Least Squares Generative Adversarial Network (LSGAN) [29] further refines the sharpness of generated images using least squares loss. These diverse loss function improvements offer valuable insights for the evolution of generative models, enabling their application in more complex tasks and scenarios, such as medical image synthesis and virtual reality scene construction, where they have shown great potential [25,26].

In recent years, advancements in generative adversarial networks (GANs) have increasingly been applied to the field of medical image generation. For example, Uh. Dar et al. [30] trained a GAN to synthesize T1-weighted brain MRIs with quality comparable to real images, while Bissoto et al. [31] successfully generated high-resolution skin lesion images that experts could not reliably distinguish from real ones. DR-GAN [32] generates high-resolution fundus images for diabetic retinopathy and can adjust the severity of retinopathy based on predefined grading and lesion information. MedGAN [33] integrates an adversarial framework with non-adversarial loss to perform end-to-end medical image conversion. MedSRGAN [34] generates high-resolution images from low-resolution medical scans, such as low-dose CT or low-field MRI images. Despite these advancements, existing methods still struggle to fully capture both the detailed features of lesions and the global context information necessary for generating complex wireless capsule endoscopy lesion images.

To address this challenge, we propose SCAGAN, which significantly improves the generation of capsule endoscopy lesion images by incorporating SCA Net and self-modulating regularization [12]. Additionally, we design the Structural Similarity Index (SSIM) loss function to quantify the structural similarity between generated and real images. SSIM loss enhances the generator’s ability to capture structural details during training, effectively mitigating the loss of fine details commonly observed with traditional adversarial loss functions. Furthermore, the integration of differential data augmentation (DiffAug) [35] boosts the robustness and generalization of both the generator and discriminator. Experimental results demonstrate that SCAGAN outperforms existing methods in generating high-quality, diverse capsule endoscopy images. It also improves structural consistency through SSIM loss, provides superior training data for lesion detection networks, and significantly enhances detection performance and robustness.

3. Methods

3.1. Overview

The SCAGAN proposed in this paper consists of a generator and a discriminator. As illustrated in Figure 1, the generator takes random noise z and a category label c as inputs. The category label c is first mapped to a high-dimensional space through an embedding layer and then concatenated with the random noise z, forming a composite feature that incorporates both category information and random noise. This concatenated feature is processed through several convolution blocks to generate the desired image. Each convolution block includes an upsampling operation, a convolution layer, Self-Modulated LayerNorm (SMLN) [14], and a ReLU activation function [27]. To further enhance the generation capability, the Self-Common Attention Network (SCA Net) mechanism is applied after the third convolution block to improve the network’s focus on key features, thereby enhancing the quality of the generated image.

The discriminator’s task is to distinguish between real and generated images. It extracts spatial features from the image through multiple convolutional layers, combines these features with the category label using an embedding layer, and performs weighted summation. Finally, the network outputs an authenticity score via a fully connected layer to determine whether the image is real or generated. The entire network is optimized using adversarial training, where the generator and discriminator compete with each other. During training, the DiffAug strategy is applied, transforming the input data through various augmentations, with the data enhancement process adjusted by backpropagation.

3.2. Special-Common Attention Network (SCA Net)

As shown in Figure 2, we designed two types of attention modules to capture both local features and global context information. Specifically, the local features from the convolutional layer are input into two parallel attention modules: the Special Attention Module and the Common Attention Module. To effectively integrate the outputs of these modules, we employ a weighted fusion method, which adaptively assigns weights based on the importance of each feature. This ensures that features are reasonably combined under varying input conditions. Compared to other fusion techniques, such as concatenation, weighted fusion is both more flexible and computationally efficient. In particular, it enhances the model’s ability to capture diverse features [36]. After fusion, the combined features are integrated with the original features via residual connections, preventing information loss, facilitating faster convergence, and preserving the original data. The Special Attention Module focuses on capturing detailed relationships within individual samples, while the Common Attention Module emphasizes uncovering global associations across different samples.

3.2.1. Special Attention Module

Generating a new feature in the Special Attention Module that contains sample context information goes through three steps. First, a Special Attention Matrix is generated, which is used to model the relationships between regions of the internal space of a single feature. Secondly, the multiplication of the attention matrix with the original features is performed. Finally, the result matrix and the original feature are summed to obtain the final feature representation, which incorporates the long-range context information. The structure of the Special Attention Module is shown in Figure 3.

We use three independent 1 × 1 convolutions to map the feature map to Q, K, and V spaces, respectively, use the dot product to calculate the similarity between Q and K, and normalize the obtained attention map to obtain the attention score:

A t t e n t i o n = {s o f t m a x}_{1} (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(1)

Wireless capsule endoscopy images exhibit high color and sensitivity similarity, leading to a highly concentrated feature space distribution and a small standard deviation of the data. In the network architecture designed in this study, after feature extraction, the small standard deviation results in generally small final vector values. When processed by the softmax function, even small input values produce non-zero outputs, which can amplify noise. To address this issue, we apply weight normalization to the attention map, dynamically adjusting the weights based on their actual contributions. This reduces the impact of irrelevant information, improving both model learning efficiency and output feature quality. As shown in Equation (2), we add an additional 1 to the softmax denominator. This slight modification ensures that when the input value is very small, the output value approaches zero. As a result, attention heads that do not capture valuable information will output vectors closer to zero, significantly reducing the noise generated by irrelevant attention heads. The specific equation is given in Equation (2):

{({s o f t m a x}_{1} (x))}_{i} = \frac{e x p (x_{i})}{1 + \sum_{j} e x p (x_{j})},

(2)

x_{i}

is the i-th element of the input x. We add 1 to the denominator of the softmax function to ensure that when the input value is very small, the output value approaches zero. As a result, when an attention head does not contribute valuable information, its output vector tends toward zero, effectively reducing unnecessary noise generated by the attention mechanism.

To enable the model to focus on information from different perspectives, we divide the space into 8 heads, performing independent attention calculations on each head. The results from all heads are then concatenated and mapped through a linear layer, which can be expressed as Equation (3):

M u l t i S p e c i a l A t t e n t i o n = C o n c a t ({h e a d}_{1}, {h e a d}_{2} . . . {h e a d}_{8}) W^{O},

(3)

3.2.2. Common Attention Module

The main task of the Special Attention Module is to extract features within a single sample and capture the relationships between different positions within the sample. However, using this method alone neglects the potential correlations between similar samples. Inspired by DANet (Dual Attention Network) [36] and EANet [37], we designed the Common Attention Module to compute the attention between the input sample and the shared attributes of the entire dataset, and then fuse this information with the Special Attention Module.

Common Attention uses a learnable parameter M that is independent of the input to store the information of the entire dataset and implicitly considers the potential relationship between different samples during the calculation process. First, the input features are transformed by a linear layer to map them to a new feature space. Then, the dot product similarity between the input features and the external memory unit M is calculated to obtain the attention weight matrix A, which reflects the correlation between the input features and the external memory unit; after obtaining the attention weight matrix, this weight matrix is applied to the external memory unit M to generate a new feature representation and apply it to the current input features. As shown in Figure 2, M is implemented using two one-dimensional convolutions

M_{1}

and

M_{2}

.

C o m m o n A t t e n t i o n = N o r m (F M_{1}^{T}) M_{2},

(4)

where F is the input feature after the convolutional layer,

M_{1}

and

M_{2}

are two linear layers representing the common features of the dataset.

Since the attention map is calculated by matrix multiplication and is sensitive to the scale of the input features, we adopt the method of reference [38] to normalize the rows and columns separately. The equations for this double normalization are as follows:

{\tilde{α}}_{i, j} = F M_{1}^{T},

(5)

{\hat{α}}_{i, j} = \frac{e x p ({\tilde{α}}_{i, j})}{1 + \sum_{k} {\tilde{α}}_{k, j}},

(6)

α_{i, j} = \frac{{\hat{α}}_{i, j}}{1 + \sum_{k} {\hat{α}}_{i, j}},

(7)

Equation (5) is used to calculate the unnormalized attention score. Equation (6) is the normalized attention score between position i and position j. Through the operation, the unnormalized score is converted into a probability distribution to ensure that their sum is 1. Equation (7) further normalizes the normalized score of each row to obtain the final attention weight, ensuring that the sum of the scores of each row is 1.

3.2.3. Self-Modulated LayerNorm

Batch Normalization (BN) estimates the mean and variance based on the current batch of data during training, which can introduce significant noise when the data distribution shifts [18]. In generative models, particularly during generator training, the distribution of generated samples often differs significantly from that of real samples, making BN unstable, as it relies on batch statistics for normalization. When the training data distribution is dynamic or diverse, BN’s effectiveness may be diminished, leading to training instability. Therefore, we opt to use Self-Modulated LayerNorm (SMLN) [14] as a more stable alternative to Batch Norm for regularization. Given noise z and class information c, the Self-Modulated LayerNorm is formulated as:

μ = \frac{1}{d} \sum_{1}^{d} z_{i}, σ = \sqrt{\frac{1}{d} \sum_{1}^{d} {(z_{i} - μ)}^{2}}

(8)

γ = W_{γ} [c; z], β = W_{β} [c; z]

(9)

\hat{z_{i}} = γ (\frac{z_{i} - μ}{σ}) + β

(10)

Here,

γ

and

β

are self-modulated through the interaction between the class label and input features, allowing dynamic adjustment of the normalization process based on the information from different classes.

In Self-Modulated LayerNorm (SMLN), the class label c and input noise z are combined to adaptively adjust the parameters γ and β in the normalization process. This approach allows the network to conditionally adjust different feature channels based on category information, improving the performance of the generative model across diverse data distributions. SMLN improves the ability of the model-to-model complex conditions, particularly in generating specific categories in generative adversarial networks.

3.2.4. SSIM Loss Function

SSIM (Structural Similarity Index Measure) is a metric used to assess the similarity between two images and is widely employed in image quality evaluation [39]. Unlike traditional metrics such as mean squared error (MSE) [40] or peak signal-to-noise ratio (PSNR) [41], SSIM emphasizes the structural information of an image and attempts to model the perceptual characteristics of the human visual system. This makes it more aligned with human visual perception, better reflecting subjective image quality. Its calculation is illustrated in Equation (11):

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(11)

where

μ_{x}

and

μ_{y}

represent the mean values of the images x and y (indicating the brightness of the image),

σ_{x}

and

σ_{y}

represent the variances of the images x and y (indicating the contrast of the image),

σ_{x y}

is the covariance between the images x and y (indicating the similarity of the structural information).

C_{1}

and

C_{2}

are stability constants used to prevent the denominator from becoming zero. Incorporating the SSIM metric as a loss function into the overall loss calculation helps to constrain the similarity between the generated wireless capsule endoscopy lesion images and real lesion images, it will improve the accuracy of generated lesion images. Therefore, the total loss function of SCAGAN can be expressed as:

{L o s s}_{t o t a l} = {m i n}_{G} {m a x}_{D} L_{a d v} + λ L_{S S I M}

(12)

In the total loss function,

λ

represents the hyperparameter of the loss function weight.

4. Experiments

To evaluate the proposed method, we conducted extensive experiments using the Kvasir-Capsule dataset [12]. Section 4.1 describes the experimental dataset, Section 4.2 outlines the evaluation metrics used in this study, Section 4.3 provides a qualitative evaluation of the capsule endoscopy lesion images generated by SCAGAN, Section 4.4 evaluates these images through both human visual perception [42] and the Fréchet Inception Distance (FID) [15], comparing them with state-of-the-art models. Section 4.5 examines the impact of the SCAGAN-augmented dataset on the lesion detection model, and Section 4.6 presents an ablation study of the proposed method. All experiments were conducted on an NVIDIA RTX 4090 24 GB GPU with a 16-core CPU. For training, we used the Adam optimizer with beta1 = 0, beta2 = 0.9. By default, the discriminator’s learning rate was set to 0.0004, while the generator’s learning rate was 0.0001.

4.1. Dataset

The Kvasir-Capsule dataset [12] is a detection dataset collected from video capsule endoscopy (VCE) technology at a hospital in Norway. It consists of 43 labeled videos covering various gastrointestinal (GI) conditions and 74 unlabeled videos. The dataset contains a total of 4,741,504 frames, with 47,238 frames annotated with lesions across 14 categories. These categories include anatomical landmarks such as the pylorus, ileocecal valve, and Vater’s papilla, as well as normal mucosa, reduced mucosal views, and a range of abnormalities including fresh blood, hemoglobin (old blood), vascular dilation (superficial blood vessels prone to bleeding), erosion, ulcers, erythema, polyps, lymphangiectasia, and foreign bodies. Additionally, there are 4,694,266 unlabeled frames, as shown in Figure 4.

Among the labeled images, there are 34,338 normal and clean mucosal images, accounting for 72.661%, while the lesion images of Ampulla of Vater and Blood-hematin accounted for only 0.021% and 0.025%, indicating a very unbalanced distribution of the dataset. In this paper, Erosion lesion data in the Kvasir-Capsule dataset was used to explore whether our model could effectively learn and generate high-quality lesion images with relatively sufficient data. The dataset is divided into a training set and a test set in an 8:2 ratio.

4.2. Evaluation Metrics

This paper adopts HYPE∞ [42] and FID indicators [15] to comprehensively measure the similarity between the original image and the generated image, so as to improve the training of the network model in a targeted manner, and it uses the mean average precision (mAP) to evaluate the performance of the lesion detection model.

4.2.1. Human eYe Perceptual Evaluation

Human eye perceptual evaluation (HYPE) [42] normalizes the human evaluation of model generation by considering the time required to distinguish between real and fake images (HYPE time) or the misjudgment rate under infinite time (HYPE∞). Theoretically, when HYPE∞ reaches 50%, the generated results are almost indistinguishable from the real data.

4.2.2. Fréchet Inception Distance

FID (Fréchet Inception Distance) was first introduced [15] to evaluate the quality of GAN-generated images by using the pre-trained initial network on the ImageNet [43] dataset. The generated samples and real images are sent to the pre-trained initial network. The mean and covariance activated in the final block are collected in the two sets, assuming a Gaussian distribution, and the Frechet distance between the two is then calculated, as shown in Equation (13):

FID (μ_{1}, σ_{1}) (μ_{2}, σ_{2}) = {||μ_{1} - μ_{2}||}_{2}^{2} + T r (σ_{1} + σ_{2} - 2 \sqrt{(σ_{1} σ_{2})})

(13)

where

μ_{1}

and

μ_{2}

are the feature means of the real image and the generated image, respectively,

σ_{1}

and

σ_{2}

are the covariance matrices of the two, Tr represents the trace of the matrix, and a lower FID value indicates that the data distribution of the generated image is closer to the real image, and the quality is better.

4.2.3. Mean Average Precision

In this paper, we use mAP (mean average precision) as the evaluation metric for the lesion detection model. mAP is a metric used to measure the overall performance of an object detection model, combining both precision and recall. It is the mean of the average precision (AP) values across all categories. AP is the integral (or approximate integral) under the PR curve, representing the average precision of the model at different recall levels. Here, TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. The equations for calculating mAP are as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P}

(14)

P r e c i s i o n = T P (T P + F N)

(15)

R e c a l l = \frac{T P}{T P + F P}

(16)

F 1_{s c o r e} = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(17)

A P = \sum_{k = 1}^{n} P (k) \cdot ∆ R (k)

(18)

m A P = \sum_{i = 1}^{k} \frac{{A P}_{i}}{k}

(19)

P(k) is the precision at the k-th point, and ΔR(k) is the recall increment between the k-th point and the previous point, i.e., R(k)–R(k − 1).

{A P}_{i}

is the average precision of the i-th category.

4.3. Qualitative Evaluation

To demonstrate the effectiveness of our method, we visually compared erosion lesion image samples generated by different models. Using an erosion lesion image as input, we generated corresponding images through SAGAN, ProGAN, and SCAGAN. Figure 5 presents the real erosion lesion image alongside the results generated by SAGAN, ProGAN, and SCAGAN. The comparison revealed that the erosion lesion images produced by SCAGAN exhibited fine texture and high definition, with a smooth transition between healthy tissue and the lesion area. Moreover, SCAGAN was able to generate lesions with varying types, shapes, sizes, and positions, showing a clear improvement over the other models.

By comparing the generated lesion images with real ones, we observed that the images produced by the SAGAN model exhibit noticeable differences in texture clarity and color transitions. The images generated by SAGAN lack the fine details and layering of the lesions and show limited diversity in lesion types. ProGAN shows improvements in texture generation and color distribution, with generated images whose colors are closer to those of real lesions. However, it still fails to capture subtle details, particularly the complexity and diversity of lesions, demonstrating insufficient generalization ability. In contrast, our SCAGAN not only accurately restores the fine structural features of the lesions, but also replicates the natural color transitions, tissue layering, and edge clarity. The images generated by SCAGAN are highly realistic and exhibit greater detail diversity, making them nearly indistinguishable from real images. Moreover, SCAGAN demonstrates exceptional diversity in generating capsule endoscopy lesion images, covering lesions of various sizes and locations, effectively capturing the complexity of lesions and the diversity of clinical manifestations.

SCAGAN outperforms other models in terms of lesion diversity, detail restoration, and color transition. The quality and variety of the wireless capsule endoscopy lesion images it generates significantly surpass those produced by other models. SCAGAN demonstrates higher practical value in capsule endoscopy lesion image generation, providing more accurate and detailed image data for capsule endoscopy analysis, early disease diagnosis, and personalized treatment.

4.4. Quantitative Evaluation

4.4.1. The Result of Human eYe Perceptual Evaluation

We invited three gastrointestinal medical experts to evaluate the quality of capsule endoscopy lesion images generated by different models. The experiment consisted of two groups: In the first group, experts independently rated both real and generated lesion images, directly comparing their quality. In the second group, experts were tasked with identifying real lesion images from a mixed set of images. To quantitatively analyze the results, we present the mean authenticity scores from the experts’ ratings in Table 1. Table 2 shows the misclassification rates for each group, including the HYPE∞, real sample misclassification rate (Real Error), and fake sample misclassification rate (Fake Error). A HYPE∞ value close to 50% indicates that the generated images have high realism, making it difficult for experts to distinguish them from real images.

In the first experiment, we used SCAGAN to generate 10,000 lesion images and randomly selected 400 images from them. We also randomly selected 400 real lesion images for comparison. These images were given to three experts in the digestive tract field for authenticity scoring. We calculated the average score given by the three experts for each group of images. The authenticity of the real images is scored because even some real images do not have perfect image quality, and the results can show upper bound performance. In addition, we evaluated six other image synthesis methods for comparison. As shown in Figure 6, the average authenticity score of the images synthesized by our SCAGAN reached 8.07, which is less than 1 point different from the real images, proving that the images generated by our network have very little difference in appearance from the real images.

In the second experiment, we used different generative models to regenerate 10,000 lesion images, and we randomly selected 400 images from each model to mix with 400 real lesion images, respectively, and asked digestive experts to judge the authenticity. The statistical results are shown in Table 2.

As shown in Table 2, SCAGAN and ViTGAN have HYPE∞ values close to 50%, at 44.25% and 43.75%, respectively, indicating that the images generated by these models are difficult to classify accurately. Although ViTGAN’s HYPE∞ is close to 50%, it has low scores in terms of average authenticity and Fake Error. In contrast, the HYPE∞ values for other networks are above 70%, with DCGAN having the highest at 77.25%, suggesting that its generated images are most easily identified as fake, highlighting a significant gap between the generated and real images. The Real Error represents the probability of a real image being misclassified as fake; lower values indicate better quality of real images. Most networks show similar performance in this metric, with values ranging from 9% to 10.5%. The Fake Error represents the probability of a fake image being misclassified as real. Higher values indicate higher authenticity in the generated images. SCAGAN achieves the highest Fake Error at 81.5%, meaning its generated images are the most difficult to distinguish from real ones, followed by PGGAN at 59%, SAGAN at 51.5%, and WGAN at 43%. ViTGAN has the lowest Fake Error (21.5%), indicating that its images are more easily identified as fake.

In summary, SCAGAN outperforms all other models in terms of realism, HYPE∞, and Fake Error. The images generated by SCAGAN exhibit a high degree of realism, making them extremely difficult to distinguish from real images, positioning SCAGAN as the most capable model for generating images close to reality.

4.4.2. Assessment of FID

Table 3 shows the FID scores of our proposed SCAGAN model and other generative networks on the dataset. Compared with other networks, our SCAGAN model achieved the lowest FID score of 31.349, which is about 86.4% lower than the baseline model DCGAN, 83.9% lower than SAGAN, and about 77.6% lower than the ProGAN model. We counted the average FID value of the model, and the results are shown in Table 3.

We also visualized the FID evaluation metric during the models’ training process, as shown in Figure 7.

Figure 7 shows the training curves of the comparison models. As can be seen from Figure 7d, the SCAGAN model we designed showed excellent training stability and generation quality during the training process. Its FID value dropped rapidly in the early stage of training and remained stable at a low level in the later stage, indicating that the adversarial training between its generator and discriminator has achieved a good balance. In contrast, the DCGAN and SAGAN models in Figure 7a,b have large fluctuations during the training process, especially in the later stage of training, where the FID value increased, indicating that their generation effects are unstable, or overfitting occurs. As can be seen from Figure 7c, although the ProGAN model is relatively stable in training, its final FID value is still higher than that of SCAGAN, indicating that its generation quality is insufficient. Overall, SCAGAN is significantly better than other models in terms of training convergence speed, stability, and final generation effect, showing excellent generation capabilities.

4.5. Data Augmentation with SCAGAN

Our primary concern is whether the lesion images generated by the SCAGAN model can effectively address the class imbalance among different lesion categories in the dataset and improve the performance of the lesion detection model. To evaluate this, we trained a detection model for erosion lesions in wireless capsule endoscopic images using the YOLOv9 framework. The training was conducted in two scenarios: one with data augmentation using the generated lesion images and another without the generated data. By comparing the performance of these two training approaches, we assess whether the generated data can enhance the model’s detection performance, especially when dealing with small sample sizes.

In order to study the influence of lesion images generated by SCAGAN on the training effect of lesion detection model, four control groups were designed in this paper: pub400 (using only 400 real images), pub200 + gen200 (200 real images and 200 generated images), pub400 + gen200 (400 real images and 200 generated images), and pub400 + gen400 (400 real images and 400 generated images). pub400 as the benchmark group aims to understand how the model performs when relying only on real images, and provide a reference standard for subsequent experiments; pub200 + gen200 combined 200 real erosion lesion images and 200 erosion lesion images generated by SACGAN to analyze whether the generated images can effectively compensate for the lack of data when data are scarce. pub400 + gen200 group added a certain amount of generated erosion lesion images on the basis of a large number of real images and further explored whether expanding the dataset with generated images would improve the performance of the target detection network with sufficient data. The pub400 + gen400 group was designed to study the effect of increasing the number of generated images on the performance of the lesion detection model. Table 4 shows the comparison of mAP50 values of the four groups of experiments, and Figure 8 shows the four training processes of the capsule endoscopic lesion detection model.

As shown in Figure 8a, when the confidence level is below 0.78, the recall rate of the dataset using pub200 + gen200 is slightly lower than that of the baseline model pub400. With the increase in confidence, the recall rates of pub400 + gen200 and pub400 + gen400 models both exceed the baseline model, indicating that the addition of a high proportion of generated images may cause certain interference in the case of a small number of real samples, resulting in a slight decline in the detection ability of the model. Using the lesion images generated by SCAGAN can significantly improve the recall ability of the model and reduce the phenomenon of missing detection. Figure 8b shows that the detection accuracy of the baseline model pub400 is the highest when the confidence level is lower than 0.65, indicating that the addition of generated images at low confidence levels may lead to false positives. In the case of high confidence, the addition of generated images can significantly improve the accuracy of the model, and pub400 + gen200 is superior to pub400 + gen400 in accuracy, indicating that the balance between the quantity and quality of generated images is more critical than simply increasing the number of generated images. Too many generated images may cause the model to overfit or introduce redundant information, resulting in limited improvement in recall rates.

In object detection, mAP50 (average accuracy with an IoU threshold of 50%) is a commonly used comprehensive performance metric that takes into account both accuracy and recall. As shown in Figure 9, in all experimental groups, after using the generated data to expand the dataset, the model performance is better than the baseline model pub400, and the pub400 + gen400 group has the highest mAP value. This indicates that the introduction of generated images can significantly improve the mAP performance of the lesion detection model, help the model better capture the characteristics of different types of lesions, and improve the detection ability of low-frequency lesions.

To prevent overfitting and ensure that the model retains its ability to generalize to new data, we implemented an early stopping mechanism during the training of the capsule endoscopic lesion detection model based on the YOLOv9 framework. Training is halted if the model’s performance does not improve for 10 consecutive epochs. As shown in Figure 9, the pub400 training, which includes only real images, allows the model to focus on learning the features of authentic samples, enabling quicker adaptation and better performance. On the other hand, pub400 + gen200, with a moderate quantity and high quality of generated images, provides additional samples to supplement the real data, enhancing the model’s generalization ability. However, this also results in a smoother loss curve, longer training time, and a delayed early stop, as the model takes longer to reach optimal performance.

As the amount of available real data increases, the marginal benefit of adding additional generated data will gradually decrease. However, when faced with a lack of real data in extreme cases, the lesion data generated by SCAGAN can effectively expand the dataset and improve model performance. It can be concluded that the capsule endoscopy lesion image data generated by our SCAGAN model can effectively improve the performance of the lesion detection model in a small-scale dataset scenario.

4.6. Ablation

In this section, we conducted a series of experiments to validate the impact of different modules and strategies on the performance of the image generation model. Specifically, we evaluated the effects of the Special Attention module, the Common Attention module, and their combined performance. Additionally, we further explored the contributions of the DiffAug strategy and the designed SSIM loss function to the model’s effectiveness. Table 5 presents the results of each experimental group.

As shown in Table 5, this paper first conducted experiments on the performance of the basic model DCGAN on the Kvasir-Capsule dataset. The experiments showed that without the use of any additional technology, the FID value of DCGAN was high, the image quality was poor, and there was a large visual difference between DCGAN and the real capsule endoscopic focal image. In order to improve the quality of the generated images, this paper first integrated the Special Attention module into DCGAN. The experimental results showed that FID decreased from 230.519 to 134.187, a decrease of about 41.83%. It shows that the introduction of Special Attention module significantly improves the quality and detail representation of local features in the generated images. Then, we introduced the Common Attention module on DCGAN, aiming to enhance the model’s Attention to the Common features of samples and improve the global structure consistency of generated images. Although the effect of Common Attention is relatively weak compared with the Special Attention module, it was still effective in reducing FID, by about 17%. In order to further optimize the model so that it can pay attention to both the details inside a single sample and the common features in the sample set, we integrate the Special Attention and Common Attention modules into DCGAN at the same time, which further optimizes the quality of the generated images. FID decreased to 88.005, showing a significant performance improvement. Through ablation experiments on DiffAug strategy, we found that DiffAug can effectively stabilize GAN training, alleviate overfitting problems caused by an imbalance of datasets, exert a good constraint on generators, and further improve the diversity and quality of generated images. Finally, we introduce the SSIM loss function, which significantly optimizes the structure and detail retention of the generated image. The experimental results show that after combining the SSIM loss function, the generated image is more realistic in terms of visual effect, and the FID reaches the lowest value, indicating that the quality and structure of the generated image are finer and more realistic. However, after adding the SSIM loss, some noise information will also be generated. After the experiment, the final choice was to add a 0.1 weight to the SSIM loss, so that the generated image has a better subjective vision, but also to ensure the quality of the image.

Through the ablation analysis of the above series of experiments, we can clearly see the independent contribution of each module and strategy in improving the performance of the generated model, as well as the synergistic effect of their combination. The experimental results showed that the combination of Special Attention and Common Attention modules, combined with the DiffAug enhancement strategy and the introduction of SSIM loss function, jointly promoted the significant improvement of lesion image quality of capsule endoscopy.

5. Conclusions

In this paper, a new attention mechanism, SCA Net, is proposed and successfully integrated into the generative adversarial network, and an image generation model, SCAGAN, is proposed for generating lesion images of capsule endoscopy. Compared with the traditional generation model, the image generated by SCAGAN has achieved a qualitative leap in fidelity and diversity and has reached the leading level in the Frechet Inception Distance (FID) index. Through comparative experiments on lesion detection models, we further verified the effectiveness of SCAGAN generated images in enhancing lesion datasets and proved the potential of SCAGAN in practical medical applications. As an innovative generative adversarial network model, SCAGAN has achieved remarkable results in the field of lesion image detection by wireless capsule endoscopy, providing a new idea and method for the development of medical image analysis and disease diagnosis technology. A key limitation of this study is the lack of integration with multimodal information, such as lesion descriptions or contextual clinical data, which could guide and refine the generation process. Incorporating such data may enhance the realism and clinical relevance of the generated images, making them more aligned with real-world medical scenarios. Future research can further expand its application in multi-modal medical image fusion and intelligent diagnosis system and continue to promote medical image technology in a more accurate and efficient direction.

Author Contributions

Conceptualization, Z.X., D.Z. and D.L.; methodology, Z.X., D.Z. and D.L.; software, Z.X.; validation, Z.X., D.Z. and D.L.; formal analysis, Z.X. and D.Z.; investigation, Z.X. and D.Z.; resources, D.Z. and X.C.; data curation, Z.X., D.Z. and D.L.; writing—original draft preparation, Z.X., D.Z. and X.C.; writing—review and editing, Z.X., D.L. and D.Z.; visualization, Z.X., D.Z. and X.C.; funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the 2022 Open Research Project of the National Key Laboratory for Special Vehicle Design and Manufacturing Integration Technology (Project Number: 2022.F.FQ. Process-0492) and the Basic Construction Funds within the Budget of Jilin Province in 2024 (No. 2024C008-7).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jia, X.; Xing, X.; Yuan, Y.; Xing, L.; Meng, M.Q.-H. Wireless Capsule Endoscopy: A New Tool for Cancer Screening in the Colon With Deep-Learning-Based Polyp Recognition. Proc. IEEE 2020, 108, 178–197. [Google Scholar] [CrossRef]
Oukdach, Y.; Kerkaou, Z.; El Ansari, M.; Koutti, L.; El Ouafdi, A.F. Gastrointestinal Diseases Classification Based on Deep Learning and Transfer Learning Mechanism. In Proceedings of the 2022 9th International Conference on Wireless Networks and Mobile Communications (WINCOM), Rabat, Morocco, 26–29 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Oukdach, Y.; Kerkaou, Z.; El Ansari, M.; Koutti, L.; El Ouafdi, A.F. Conv-vit: Feature fusion-based detection of gastrointestinal abnormalities using cnn and vit in wce images. In Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Istanbul, Turkiye, 26–28 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Qiu, K.; Zhou, Z.; Guo, Y. Learn From Zoom: Decoupled Supervised Contrastive Learning For WCE Image Classification. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2245–2249. [Google Scholar] [CrossRef]
Hyun, C.M.; Kim, H.P.; Lee, S.M.; Lee, S.; Seo, J.K. Deep learning for undersampled MRI reconstruction. Phys. Med. Biol. 2018, 63, 135007. [Google Scholar] [CrossRef]
Alçalar, Y.U.; Gülle, M.; Akçakaya, M. Training Physics-Driven Deep Learning Reconstruction without Raw Data Access for Equitable Fast MRI. arXiv 2024, arXiv:2411.13022. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Albaba, B.M.; Ozer, S. SyNet: An ensemble network for object detection in UAV images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10227–10234. [Google Scholar] [CrossRef]
Arkin, E.; Yadikar, N.; Xu, X.; Aysa, A.; Ubul, K. A survey: Object detection methods from CNN to transformer. Multimed. Tools Appl. 2023, 82, 21353–21383. [Google Scholar] [CrossRef]
Li, Z.; Tucker, R.; Snavely, N.; Holynski, A. Generative Image Dynamics. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 24142–24153. [Google Scholar] [CrossRef]
Anciukevičius, T.; Xu, Z.; Fisher, M.; Henderson, P.; Bilen, H.; Mitra, N.J.; Guerrero, P. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12608–12618. [Google Scholar] [CrossRef]
Smedsrud, P.H.; Thambawita, V.; Hicks, S.A.; Gjestang, H.; Nedrejord, O.O.; Næss, E.; Borgli, H.; Jha, D.; Berstad, T.J.D.; Eskeland, S.L.; et al. Kvasir-Capsule, a Video Capsule Endoscopy Dataset. Sci. Data 2021, 8, 142. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Wang, B.; He, X.; Cui, S.; Shao, L. DR-GAN: Conditional Generative Adversarial Network for Fine-Grained Lesion Synthesis on Diabetic Retinopathy Images. IEEE J. Biomed. Health Inform. 2022, 26, 56–66. [Google Scholar] [CrossRef]
Chen, T.; Lucic, M.; Houlsby, N.; Gelly, S. On Self Modulation for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 139–144. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Higgins, N.; Li, L.; Gururajan, R.; Zhou, X.; Acharya, U.R. Remote Patient Monitoring Using Artificial Intelligence: Current State, Applications, and Challenges. WIREs Data Min. Knowl. Discov. 2023, 13, e1485. [Google Scholar] [CrossRef]
Lee, K.; Chang, H.; Jiang, L.; Zhang, H.; Tu, Z.; Liu, C. ViTGAN: Training GANs with Vision Transformers. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning, Singapore, 24–26 February 2017; pp. 214–223. [Google Scholar]
Mirza, M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, San Juan, PR, USA, 2–4 May 2016. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2813–2821. [Google Scholar] [CrossRef]
Dar, S.; Yurt, M.; Karacan, L.; Erdem, A.; Erdem, E.; Çukur, T. Image Synthesis in Multi-Contrast MRI with Conditional Generative Adversarial Networks. IEEE Trans. Med. Imaging 2018, 38, 2375–2388. [Google Scholar] [CrossRef] [PubMed]
Bissoto, A.; Perez, F.; Valle, E.; Avila, S. Skin Lesion Synthesis with Generative Adversarial Networks. In Lecture Notes in Computer Science, OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis; Springer International Publishing: New York, NY, USA, 2018; pp. 294–302. [Google Scholar] [CrossRef]
Gu, Y.; Peng, Y.; Li, H. AIDS Brain MRIs Synthesis via Generative Adversarial Networks Based on Attention-Encoder. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 629–633. [Google Scholar] [CrossRef]
Armanious, K.; Jiang, C.; Fischer, M.; Küstner, T.; Hepp, T.; Nikolaou, K.; Gatidis, S.; Yang, B. MedGAN: Medical Image Translation Using GANs. Comput. Med. Imaging Graph. 2020, 79, 101684. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Zeng, Z.; Chen, H.; Wei, J.; Zhang, Y.; Chen, B.; Li, Y.; Qin, Y.; Xie, Q.; Jiang, Z.; et al. MedSRGAN: Medical Images Super-Resolution Using Generative Adversarial Networks. Multimed. Tools Appl. 2020, 79, 21815–21840. [Google Scholar] [CrossRef]
Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.-Y.; Han, S. Differentiable Augmentation for Data-Efficient GAN Training. Adv. Neural Inf. Process. Syst. 2020, 33, 7559–7570. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Hu, S.-M. Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef] [PubMed]
Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R.R.; Hu, S.-M. PCT: Point Cloud Transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Ren, J.; Zhang, M.; Yu, C.; Liu, Z. Balanced mse for imbalanced visual regression. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7926–7935.
Fardo, F.A.; Conforto, V.H.; de Oliveira, F.C.; Rodrigues, P.S. A Formal Evaluation of PSNR as Quality Measurement Parameter for Image Segmentation Algorithms. arXiv 2016, arXiv:1605.07116. [Google Scholar]
Zhou, S.; Gordon, M.; Krishna, R.; Narcomey, A.; Fei-Fei, L.F.; Bernstein, M. HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed SCAGAN.

Figure 2. Special-Common Attention Network.

Figure 3. Special Attention Module.

Figure 4. Statistics of the data distribution in the Kvasir-Capsule dataset.

Figure 5. Comparision of real erosion images and images genrated by ProGAN, SAGAN and our SCAGAN.

Figure 6. Human evaluation of the fidelity of synthesized images.

Figure 7. The FID evaluation results of different model training processes: (a) The FID curve of DCGAN; (b) The FID curve of SAGAN; (c) The FID curve of ProGAN; (d) The FID curve of SCAGAN.

Figure 8. Training curves for the detection model in different datasets: (a) Recall-confidence curve; (b) Precision-confidence curve.

Figure 9. mAP50 Curve for the detection model indifferent datasets.

Table 1. Image reality score rules.

Evaluation Criteria	Scoring Details	Score
Morphology and Structural Accuracy	Perfectly consistent (no deformities)	3
	Generally consistent (minor differences)	2
	Significant deviation, recognizable	1
	Greatly differs, unrecognized	0
Texture and Detail Representation	Highly similar to the real case	2
	Occasional minor differences	1.5
	Noticeable deviations or blurriness	1
	Lack clear texture and details, appearing unnatural	0
Medical Accuracy and Pathological Consistency	Fully meet (consistent with common pathology)	2
	Generally match (slight deviations)	1.5
	Significant deviation, inferable pathology	1
	Do not match any known pathology or medical knowledge	0
Background and Environmental Realism	Natural and perfectly integrated	1
	Mostly natural, minor inconsistencies	0.75
	Obvious compositing	0.5
	Unnatural, severely disconnected	0
Color and Contrast Naturalness	Natural, conforms to medical imaging standards	1
	Close to natural, minor distortion	0.75
	Significant distortion, affects detail recognition	0.5
	Severely distorted, hard to diagnosis	0
Noise and Artifacts	Clear, no noise/artifacts	0.5
	Very little noise/artifacts, minimal impact	0.25
	Noticeable noise/artifacts, poor quality	0
Clinical Usability	High quality, suitable for clinical use	1
	Average quality, auxiliary use only	0.5
	Poor quality, unsuitable for diagnosis	0

Table 2. Human evaluation of fidelity for mixed real and synthesized images.

Model	HYPE∞	Real Error	Fake Error
DCGAN	77.25%	9.5%	36%
WGAN	73.75%	9.5%	43%
SAGAN	70.25%	8%	51.5%
ProGAN	65.25%	10.5%	59%
VitGAN	43.75%	9%	21.5%
SCAGAN	44.25%	12%	81.5%

Table 3. Comparison of FID Scores for Different GANs.

Model	FID
DCGAN	230.519
WGAN	198.316
SAGAN	194.801
VitGAN	256.337
ProGAN	138.508
SCAGAN	31.349

Table 4. Comparison of the mAP of the YoloV9 detection model in different dataset.

Model	mAP
pub400	97.634
pub200 + gen200	96.928
pub400 + gen200	98.792
pub400 + gen400	99.129

Table 5. Comparison of FID in ablation studies. √ indicates the inclusion of the module, while × indicates the absence of the module.

DCGAN	Special Attention Module	Common Attention Module	DiffAug	SSIM Loss	FID
√	×	×	×	×	230.519
√	√	×	×	×	134.187
√	×	√	×	×	191.319
√	√	√	×	×	88.005
√	√	√	√	×	61.688
√	√	√	√	√	39.721

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, Z.; Zhang, D.; Chen, X.; Li, D. SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN. Electronics 2025, 14, 428. https://doi.org/10.3390/electronics14030428

AMA Style

Xiao Z, Zhang D, Chen X, Li D. SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN. Electronics. 2025; 14(3):428. https://doi.org/10.3390/electronics14030428

Chicago/Turabian Style

Xiao, Zhiguo, Dong Zhang, Xianqing Chen, and Dongni Li. 2025. "SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN" Electronics 14, no. 3: 428. https://doi.org/10.3390/electronics14030428

APA Style

Xiao, Z., Zhang, D., Chen, X., & Li, D. (2025). SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN. Electronics, 14(3), 428. https://doi.org/10.3390/electronics14030428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCAGAN: Wireless Capsule Endoscopy Lesion Image Generation Model Based on GAN

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Overview

3.2. Special-Common Attention Network (SCA Net)

3.2.1. Special Attention Module

3.2.2. Common Attention Module

3.2.3. Self-Modulated LayerNorm

3.2.4. SSIM Loss Function

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.2.1. Human eYe Perceptual Evaluation

4.2.2. Fréchet Inception Distance

4.2.3. Mean Average Precision

4.3. Qualitative Evaluation

4.4. Quantitative Evaluation

4.4.1. The Result of Human eYe Perceptual Evaluation

4.4.2. Assessment of FID

4.5. Data Augmentation with SCAGAN

4.6. Ablation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI