1. Introduction
Wireless capsule endoscopy (WCE) is renowned worldwide for its patient-centric approach and non-invasiveness. It provides an innovative non-invasive examination method for the complete digestive tract examination of the human body and has been clinically used for more than 20 years [
1].
The wireless capsule endoscope is a small, swallowable capsule that contains a built-in camera and a signal transmission system. It travels through the digestive tract using natural peristalsis, capturing images as it moves until it is eventually excreted from the body. Doctors diagnose the patient’s condition by analyzing the collected images. The working time of the wireless capsule endoscope in the human body usually reaches 5–8 h, while the lesion images often only account for less than 1% of the total number of images [
2]. It takes doctors at least 3 h to sift through the vast and complex array of images to select the key lesion images. During this time, they often miss or incorrectly identify lesion images, leading to a low efficiency in manual detection. In view of this, many researchers are committed to developing efficient algorithms to automatically identify abnormalities in WCE images and assist clinicians in diagnosing gastrointestinal diseases more accurately and efficiently [
2,
3,
4]. Compared with manual feature extraction and traditional machine learning methods, deep convolutional neural networks (CNNS) have achieved great success in many visual tasks, such as image reconstruction [
5,
6], object detection [
7,
8,
9] and image generation [
10,
11].
Deep learning has become a popular method for detecting anomalies in wireless capsule endoscopy (WCE) images. To train an effective lesion detection model, it is essential to have a large, diverse, and balanced dataset. Unfortunately, the limited availability of abnormal WCE images, along with the sensitivity of the data and the expertise required for accurate annotation, makes it challenging to create a comprehensive dataset of lesions. As a result, existing public datasets of wireless capsule endoscopy images often have an imbalanced distribution of different lesion types. For instance, in the Kvasir-Capsule dataset, which is the largest capsule endoscopy dataset [
12], healthy images account for 73.93%, while erosion lesions and polyp lesions comprise only 1.09% and 0.12%, respectively. Using such an imbalanced dataset can lead to a lesion detection model that is less sensitive to detecting smaller lesions and may result in overfitting.
Conventional data augmentation techniques, such as flipping, random cropping, and rotation, can alleviate data scarcity to some degree. However, these methods primarily focus on geometric or color transformations of images and often do not take into account the semantic content. As a result, the enhanced images may not sufficiently preserve the critical medical features and structural integrity present in the original images [
13]. In this paper, we propose a novel generative adversarial network (SCAGAN) for generating wireless capsule endoscopy lesion images. A series of experiments demonstrate that SCAGAN can effectively augment small sample datasets, addressing the issue of data imbalance in training multiclass capsule endoscopy lesion detection models. The main contributions of this paper are as follows:
A Special Common Attention Net (SCA Net) model was proposed, and the SCAGAN model for generating capsule endoscopy lesion images was designed by combining the existing generative model architecture. This method effectively improved the quality and diversity of the WCE lesion images generated.
By incorporating self-modulated regularization [
14] and the DiffAug technique [
15], the training of both the generator and discriminator in the proposed model is optimized, leading to significant improvements in model stability and generalization.
We propose and design a Structural Similarity Loss function (SSIM Loss) and demonstrate its effectiveness in improving the convergence speed of model training.
Through both qualitative and quantitative analysis of the generated images, we validated the effectiveness of SCAGAN in generating capsule endoscopic lesion images. These images were then used to train the lesion detection model. Experimental comparisons showed that this approach significantly enhanced the accuracy and robustness of the detection model.
2. Related Work
Generative adversarial networks (GANs) have become a key area of deep learning research since their introduction by Goodfellow et al. [
16] in 2014, thanks to their ability to generate realistic data distributions. However, the original GAN often encounters issues such as unstable training and mode collapse [
17], primarily due to the use of JS divergence as the loss function. To address these challenges, researchers have introduced numerous improvements to the architecture [
18] and loss function [
19] of GANs.
To enhance the diversity and controllability of generated samples, Mirza and Osindero [
20] introduced the Conditional Generative Adversarial Network (CGAN) in 2014. By incorporating conditional information (such as labels or specific inputs) into the training of both the generator and discriminator, CGAN allows the generator to produce target samples based on predefined conditions. To improve the training stability and generative performance of GANs, Radford et al. [
21] proposed the Deep Convolutional Generative Adversarial Network (DCGAN), which leverages convolutional layers, batch normalization, and a tailored loss function to generate high-quality images. This approach enables the network to autonomously learn and generate images, significantly enhancing both image quality and training stability. In 2017, Zhu et al. [
22] introduced CycleGAN, which enables image translation tasks without the need for paired training data, significantly broadening the practical applications of generative adversarial networks. In 2018, Progressive GAN (ProGAN) [
23] addressed training instability by progressively increasing the resolution of both the generator and discriminator, enabling more efficient training of high-resolution images. SAGAN [
24] introduced a self-attention mechanism in the generative adversarial network, which enables the generator to consider the contextual information of other positions in the image when generating each pixel. This mechanism enables the model to better capture global dependencies and further improves the quality of generated images in terms of detail representation and global consistency. Brock et al. proposed BigGAN [
25], which greatly improves the resolution and diversity of generated images by using larger models and richer training datasets. In 2019, NVIDIA proposed StyleGAN [
26]. By combining the styles of potential spaces with different levels of generators, StyleGAN makes the generated images more realistic, and allows flexible control of every detail of the image (such as facial features, background, etc.). The DALL·E model [
27] proposed by OpenAI in 2021 based on generative adduction network and transformer can generate corresponding images according to text description.
Improving the loss function of generative models is another key area of development. For instance, Wasserstein GAN (WGAN) [
19] replaces JS divergence with Wasserstein distance, while combining weight clipping and gradient penalty (WGAN-GP) [
28] to significantly enhance training stability and generation quality. The Least Squares Generative Adversarial Network (LSGAN) [
29] further refines the sharpness of generated images using least squares loss. These diverse loss function improvements offer valuable insights for the evolution of generative models, enabling their application in more complex tasks and scenarios, such as medical image synthesis and virtual reality scene construction, where they have shown great potential [
25,
26].
In recent years, advancements in generative adversarial networks (GANs) have increasingly been applied to the field of medical image generation. For example, Uh. Dar et al. [
30] trained a GAN to synthesize T1-weighted brain MRIs with quality comparable to real images, while Bissoto et al. [
31] successfully generated high-resolution skin lesion images that experts could not reliably distinguish from real ones. DR-GAN [
32] generates high-resolution fundus images for diabetic retinopathy and can adjust the severity of retinopathy based on predefined grading and lesion information. MedGAN [
33] integrates an adversarial framework with non-adversarial loss to perform end-to-end medical image conversion. MedSRGAN [
34] generates high-resolution images from low-resolution medical scans, such as low-dose CT or low-field MRI images. Despite these advancements, existing methods still struggle to fully capture both the detailed features of lesions and the global context information necessary for generating complex wireless capsule endoscopy lesion images.
To address this challenge, we propose SCAGAN, which significantly improves the generation of capsule endoscopy lesion images by incorporating SCA Net and self-modulating regularization [
12]. Additionally, we design the Structural Similarity Index (SSIM) loss function to quantify the structural similarity between generated and real images. SSIM loss enhances the generator’s ability to capture structural details during training, effectively mitigating the loss of fine details commonly observed with traditional adversarial loss functions. Furthermore, the integration of differential data augmentation (DiffAug) [
35] boosts the robustness and generalization of both the generator and discriminator. Experimental results demonstrate that SCAGAN outperforms existing methods in generating high-quality, diverse capsule endoscopy images. It also improves structural consistency through SSIM loss, provides superior training data for lesion detection networks, and significantly enhances detection performance and robustness.
4. Experiments
To evaluate the proposed method, we conducted extensive experiments using the Kvasir-Capsule dataset [
12].
Section 4.1 describes the experimental dataset,
Section 4.2 outlines the evaluation metrics used in this study,
Section 4.3 provides a qualitative evaluation of the capsule endoscopy lesion images generated by SCAGAN,
Section 4.4 evaluates these images through both human visual perception [
42] and the Fréchet Inception Distance (FID) [
15], comparing them with state-of-the-art models.
Section 4.5 examines the impact of the SCAGAN-augmented dataset on the lesion detection model, and
Section 4.6 presents an ablation study of the proposed method. All experiments were conducted on an NVIDIA RTX 4090 24 GB GPU with a 16-core CPU. For training, we used the Adam optimizer with beta1 = 0, beta2 = 0.9. By default, the discriminator’s learning rate was set to 0.0004, while the generator’s learning rate was 0.0001.
4.1. Dataset
The Kvasir-Capsule dataset [
12] is a detection dataset collected from video capsule endoscopy (VCE) technology at a hospital in Norway. It consists of 43 labeled videos covering various gastrointestinal (GI) conditions and 74 unlabeled videos. The dataset contains a total of 4,741,504 frames, with 47,238 frames annotated with lesions across 14 categories. These categories include anatomical landmarks such as the pylorus, ileocecal valve, and Vater’s papilla, as well as normal mucosa, reduced mucosal views, and a range of abnormalities including fresh blood, hemoglobin (old blood), vascular dilation (superficial blood vessels prone to bleeding), erosion, ulcers, erythema, polyps, lymphangiectasia, and foreign bodies. Additionally, there are 4,694,266 unlabeled frames, as shown in
Figure 4.
Among the labeled images, there are 34,338 normal and clean mucosal images, accounting for 72.661%, while the lesion images of Ampulla of Vater and Blood-hematin accounted for only 0.021% and 0.025%, indicating a very unbalanced distribution of the dataset. In this paper, Erosion lesion data in the Kvasir-Capsule dataset was used to explore whether our model could effectively learn and generate high-quality lesion images with relatively sufficient data. The dataset is divided into a training set and a test set in an 8:2 ratio.
4.2. Evaluation Metrics
This paper adopts HYPE∞ [
42] and FID indicators [
15] to comprehensively measure the similarity between the original image and the generated image, so as to improve the training of the network model in a targeted manner, and it uses the mean average precision (mAP) to evaluate the performance of the lesion detection model.
4.2.1. Human eYe Perceptual Evaluation
Human eye perceptual evaluation (HYPE) [
42] normalizes the human evaluation of model generation by considering the time required to distinguish between real and fake images (HYPE time) or the misjudgment rate under infinite time (HYPE∞). Theoretically, when HYPE∞ reaches 50%, the generated results are almost indistinguishable from the real data.
4.2.2. Fréchet Inception Distance
FID (Fréchet Inception Distance) was first introduced [
15] to evaluate the quality of GAN-generated images by using the pre-trained initial network on the ImageNet [
43] dataset. The generated samples and real images are sent to the pre-trained initial network. The mean and covariance activated in the final block are collected in the two sets, assuming a Gaussian distribution, and the Frechet distance between the two is then calculated, as shown in Equation (13):
where
and
are the feature means of the real image and the generated image, respectively,
and
are the covariance matrices of the two, Tr represents the trace of the matrix, and a lower FID value indicates that the data distribution of the generated image is closer to the real image, and the quality is better.
4.2.3. Mean Average Precision
In this paper, we use mAP (mean average precision) as the evaluation metric for the lesion detection model. mAP is a metric used to measure the overall performance of an object detection model, combining both precision and recall. It is the mean of the average precision (AP) values across all categories. AP is the integral (or approximate integral) under the PR curve, representing the average precision of the model at different recall levels. Here, TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. The equations for calculating mAP are as follows:
P(k) is the precision at the k-th point, and ΔR(k) is the recall increment between the k-th point and the previous point, i.e., R(k)–R(k − 1). is the average precision of the i-th category.
4.3. Qualitative Evaluation
To demonstrate the effectiveness of our method, we visually compared erosion lesion image samples generated by different models. Using an erosion lesion image as input, we generated corresponding images through SAGAN, ProGAN, and SCAGAN.
Figure 5 presents the real erosion lesion image alongside the results generated by SAGAN, ProGAN, and SCAGAN. The comparison revealed that the erosion lesion images produced by SCAGAN exhibited fine texture and high definition, with a smooth transition between healthy tissue and the lesion area. Moreover, SCAGAN was able to generate lesions with varying types, shapes, sizes, and positions, showing a clear improvement over the other models.
By comparing the generated lesion images with real ones, we observed that the images produced by the SAGAN model exhibit noticeable differences in texture clarity and color transitions. The images generated by SAGAN lack the fine details and layering of the lesions and show limited diversity in lesion types. ProGAN shows improvements in texture generation and color distribution, with generated images whose colors are closer to those of real lesions. However, it still fails to capture subtle details, particularly the complexity and diversity of lesions, demonstrating insufficient generalization ability. In contrast, our SCAGAN not only accurately restores the fine structural features of the lesions, but also replicates the natural color transitions, tissue layering, and edge clarity. The images generated by SCAGAN are highly realistic and exhibit greater detail diversity, making them nearly indistinguishable from real images. Moreover, SCAGAN demonstrates exceptional diversity in generating capsule endoscopy lesion images, covering lesions of various sizes and locations, effectively capturing the complexity of lesions and the diversity of clinical manifestations.
SCAGAN outperforms other models in terms of lesion diversity, detail restoration, and color transition. The quality and variety of the wireless capsule endoscopy lesion images it generates significantly surpass those produced by other models. SCAGAN demonstrates higher practical value in capsule endoscopy lesion image generation, providing more accurate and detailed image data for capsule endoscopy analysis, early disease diagnosis, and personalized treatment.
4.4. Quantitative Evaluation
4.4.1. The Result of Human eYe Perceptual Evaluation
We invited three gastrointestinal medical experts to evaluate the quality of capsule endoscopy lesion images generated by different models. The experiment consisted of two groups: In the first group, experts independently rated both real and generated lesion images, directly comparing their quality. In the second group, experts were tasked with identifying real lesion images from a mixed set of images. To quantitatively analyze the results, we present the mean authenticity scores from the experts’ ratings in
Table 1.
Table 2 shows the misclassification rates for each group, including the HYPE∞, real sample misclassification rate (Real Error), and fake sample misclassification rate (Fake Error). A HYPE∞ value close to 50% indicates that the generated images have high realism, making it difficult for experts to distinguish them from real images.
In the first experiment, we used SCAGAN to generate 10,000 lesion images and randomly selected 400 images from them. We also randomly selected 400 real lesion images for comparison. These images were given to three experts in the digestive tract field for authenticity scoring. We calculated the average score given by the three experts for each group of images. The authenticity of the real images is scored because even some real images do not have perfect image quality, and the results can show upper bound performance. In addition, we evaluated six other image synthesis methods for comparison. As shown in
Figure 6, the average authenticity score of the images synthesized by our SCAGAN reached 8.07, which is less than 1 point different from the real images, proving that the images generated by our network have very little difference in appearance from the real images.
In the second experiment, we used different generative models to regenerate 10,000 lesion images, and we randomly selected 400 images from each model to mix with 400 real lesion images, respectively, and asked digestive experts to judge the authenticity. The statistical results are shown in
Table 2.
As shown in
Table 2, SCAGAN and ViTGAN have HYPE∞ values close to 50%, at 44.25% and 43.75%, respectively, indicating that the images generated by these models are difficult to classify accurately. Although ViTGAN’s HYPE∞ is close to 50%, it has low scores in terms of average authenticity and Fake Error. In contrast, the HYPE∞ values for other networks are above 70%, with DCGAN having the highest at 77.25%, suggesting that its generated images are most easily identified as fake, highlighting a significant gap between the generated and real images. The Real Error represents the probability of a real image being misclassified as fake; lower values indicate better quality of real images. Most networks show similar performance in this metric, with values ranging from 9% to 10.5%. The Fake Error represents the probability of a fake image being misclassified as real. Higher values indicate higher authenticity in the generated images. SCAGAN achieves the highest Fake Error at 81.5%, meaning its generated images are the most difficult to distinguish from real ones, followed by PGGAN at 59%, SAGAN at 51.5%, and WGAN at 43%. ViTGAN has the lowest Fake Error (21.5%), indicating that its images are more easily identified as fake.
In summary, SCAGAN outperforms all other models in terms of realism, HYPE∞, and Fake Error. The images generated by SCAGAN exhibit a high degree of realism, making them extremely difficult to distinguish from real images, positioning SCAGAN as the most capable model for generating images close to reality.
4.4.2. Assessment of FID
Table 3 shows the FID scores of our proposed SCAGAN model and other generative networks on the dataset. Compared with other networks, our SCAGAN model achieved the lowest FID score of 31.349, which is about 86.4% lower than the baseline model DCGAN, 83.9% lower than SAGAN, and about 77.6% lower than the ProGAN model. We counted the average FID value of the model, and the results are shown in
Table 3.
We also visualized the FID evaluation metric during the models’ training process, as shown in
Figure 7.
Figure 7 shows the training curves of the comparison models. As can be seen from
Figure 7d, the SCAGAN model we designed showed excellent training stability and generation quality during the training process. Its FID value dropped rapidly in the early stage of training and remained stable at a low level in the later stage, indicating that the adversarial training between its generator and discriminator has achieved a good balance. In contrast, the DCGAN and SAGAN models in
Figure 7a,b have large fluctuations during the training process, especially in the later stage of training, where the FID value increased, indicating that their generation effects are unstable, or overfitting occurs. As can be seen from
Figure 7c, although the ProGAN model is relatively stable in training, its final FID value is still higher than that of SCAGAN, indicating that its generation quality is insufficient. Overall, SCAGAN is significantly better than other models in terms of training convergence speed, stability, and final generation effect, showing excellent generation capabilities.
4.5. Data Augmentation with SCAGAN
Our primary concern is whether the lesion images generated by the SCAGAN model can effectively address the class imbalance among different lesion categories in the dataset and improve the performance of the lesion detection model. To evaluate this, we trained a detection model for erosion lesions in wireless capsule endoscopic images using the YOLOv9 framework. The training was conducted in two scenarios: one with data augmentation using the generated lesion images and another without the generated data. By comparing the performance of these two training approaches, we assess whether the generated data can enhance the model’s detection performance, especially when dealing with small sample sizes.
In order to study the influence of lesion images generated by SCAGAN on the training effect of lesion detection model, four control groups were designed in this paper: pub400 (using only 400 real images), pub200 + gen200 (200 real images and 200 generated images), pub400 + gen200 (400 real images and 200 generated images), and pub400 + gen400 (400 real images and 400 generated images). pub400 as the benchmark group aims to understand how the model performs when relying only on real images, and provide a reference standard for subsequent experiments; pub200 + gen200 combined 200 real erosion lesion images and 200 erosion lesion images generated by SACGAN to analyze whether the generated images can effectively compensate for the lack of data when data are scarce. pub400 + gen200 group added a certain amount of generated erosion lesion images on the basis of a large number of real images and further explored whether expanding the dataset with generated images would improve the performance of the target detection network with sufficient data. The pub400 + gen400 group was designed to study the effect of increasing the number of generated images on the performance of the lesion detection model.
Table 4 shows the comparison of mAP50 values of the four groups of experiments, and
Figure 8 shows the four training processes of the capsule endoscopic lesion detection model.
As shown in
Figure 8a, when the confidence level is below 0.78, the recall rate of the dataset using pub200 + gen200 is slightly lower than that of the baseline model pub400. With the increase in confidence, the recall rates of pub400 + gen200 and pub400 + gen400 models both exceed the baseline model, indicating that the addition of a high proportion of generated images may cause certain interference in the case of a small number of real samples, resulting in a slight decline in the detection ability of the model. Using the lesion images generated by SCAGAN can significantly improve the recall ability of the model and reduce the phenomenon of missing detection.
Figure 8b shows that the detection accuracy of the baseline model pub400 is the highest when the confidence level is lower than 0.65, indicating that the addition of generated images at low confidence levels may lead to false positives. In the case of high confidence, the addition of generated images can significantly improve the accuracy of the model, and pub400 + gen200 is superior to pub400 + gen400 in accuracy, indicating that the balance between the quantity and quality of generated images is more critical than simply increasing the number of generated images. Too many generated images may cause the model to overfit or introduce redundant information, resulting in limited improvement in recall rates.
In object detection, mAP50 (average accuracy with an IoU threshold of 50%) is a commonly used comprehensive performance metric that takes into account both accuracy and recall. As shown in
Figure 9, in all experimental groups, after using the generated data to expand the dataset, the model performance is better than the baseline model pub400, and the pub400 + gen400 group has the highest mAP value. This indicates that the introduction of generated images can significantly improve the mAP performance of the lesion detection model, help the model better capture the characteristics of different types of lesions, and improve the detection ability of low-frequency lesions.
To prevent overfitting and ensure that the model retains its ability to generalize to new data, we implemented an early stopping mechanism during the training of the capsule endoscopic lesion detection model based on the YOLOv9 framework. Training is halted if the model’s performance does not improve for 10 consecutive epochs. As shown in
Figure 9, the pub400 training, which includes only real images, allows the model to focus on learning the features of authentic samples, enabling quicker adaptation and better performance. On the other hand, pub400 + gen200, with a moderate quantity and high quality of generated images, provides additional samples to supplement the real data, enhancing the model’s generalization ability. However, this also results in a smoother loss curve, longer training time, and a delayed early stop, as the model takes longer to reach optimal performance.
As the amount of available real data increases, the marginal benefit of adding additional generated data will gradually decrease. However, when faced with a lack of real data in extreme cases, the lesion data generated by SCAGAN can effectively expand the dataset and improve model performance. It can be concluded that the capsule endoscopy lesion image data generated by our SCAGAN model can effectively improve the performance of the lesion detection model in a small-scale dataset scenario.
4.6. Ablation
In this section, we conducted a series of experiments to validate the impact of different modules and strategies on the performance of the image generation model. Specifically, we evaluated the effects of the Special Attention module, the Common Attention module, and their combined performance. Additionally, we further explored the contributions of the DiffAug strategy and the designed SSIM loss function to the model’s effectiveness.
Table 5 presents the results of each experimental group.
As shown in
Table 5, this paper first conducted experiments on the performance of the basic model DCGAN on the Kvasir-Capsule dataset. The experiments showed that without the use of any additional technology, the FID value of DCGAN was high, the image quality was poor, and there was a large visual difference between DCGAN and the real capsule endoscopic focal image. In order to improve the quality of the generated images, this paper first integrated the Special Attention module into DCGAN. The experimental results showed that FID decreased from 230.519 to 134.187, a decrease of about 41.83%. It shows that the introduction of Special Attention module significantly improves the quality and detail representation of local features in the generated images. Then, we introduced the Common Attention module on DCGAN, aiming to enhance the model’s Attention to the Common features of samples and improve the global structure consistency of generated images. Although the effect of Common Attention is relatively weak compared with the Special Attention module, it was still effective in reducing FID, by about 17%. In order to further optimize the model so that it can pay attention to both the details inside a single sample and the common features in the sample set, we integrate the Special Attention and Common Attention modules into DCGAN at the same time, which further optimizes the quality of the generated images. FID decreased to 88.005, showing a significant performance improvement. Through ablation experiments on DiffAug strategy, we found that DiffAug can effectively stabilize GAN training, alleviate overfitting problems caused by an imbalance of datasets, exert a good constraint on generators, and further improve the diversity and quality of generated images. Finally, we introduce the SSIM loss function, which significantly optimizes the structure and detail retention of the generated image. The experimental results show that after combining the SSIM loss function, the generated image is more realistic in terms of visual effect, and the FID reaches the lowest value, indicating that the quality and structure of the generated image are finer and more realistic. However, after adding the SSIM loss, some noise information will also be generated. After the experiment, the final choice was to add a 0.1 weight to the SSIM loss, so that the generated image has a better subjective vision, but also to ensure the quality of the image.
Through the ablation analysis of the above series of experiments, we can clearly see the independent contribution of each module and strategy in improving the performance of the generated model, as well as the synergistic effect of their combination. The experimental results showed that the combination of Special Attention and Common Attention modules, combined with the DiffAug enhancement strategy and the introduction of SSIM loss function, jointly promoted the significant improvement of lesion image quality of capsule endoscopy.
5. Conclusions
In this paper, a new attention mechanism, SCA Net, is proposed and successfully integrated into the generative adversarial network, and an image generation model, SCAGAN, is proposed for generating lesion images of capsule endoscopy. Compared with the traditional generation model, the image generated by SCAGAN has achieved a qualitative leap in fidelity and diversity and has reached the leading level in the Frechet Inception Distance (FID) index. Through comparative experiments on lesion detection models, we further verified the effectiveness of SCAGAN generated images in enhancing lesion datasets and proved the potential of SCAGAN in practical medical applications. As an innovative generative adversarial network model, SCAGAN has achieved remarkable results in the field of lesion image detection by wireless capsule endoscopy, providing a new idea and method for the development of medical image analysis and disease diagnosis technology. A key limitation of this study is the lack of integration with multimodal information, such as lesion descriptions or contextual clinical data, which could guide and refine the generation process. Incorporating such data may enhance the realism and clinical relevance of the generated images, making them more aligned with real-world medical scenarios. Future research can further expand its application in multi-modal medical image fusion and intelligent diagnosis system and continue to promote medical image technology in a more accurate and efficient direction.