Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology

Teramoto, Atsushi; Kiriyama, Yuka; Michiba, Ayano; Yazawa, Natsuki; Tsukamoto, Tetsuya; Imaizumi, Kazuyoshi; Fujita, Hiroshi

doi:10.3390/computers13110303

Open AccessArticle

Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology

by

Atsushi Teramoto

^1,*

,

Yuka Kiriyama

^2,3,

Ayano Michiba

²,

Natsuki Yazawa

⁴,

Tetsuya Tsukamoto

^5,6

,

Kazuyoshi Imaizumi

²

and

Hiroshi Fujita

⁷

¹

Faculty of Information Engineering, Meijo University, Nagoya 468-8502, Japan

²

School of Medicine, Fujita Health University, Toyoake 470-1192, Japan

³

Narita Memorial Hospital, Toyohashi 441-8029, Japan

⁴

Fujita Health University Hospital, Toyoake 470-1192, Japan

⁵

Oncology Innovation Center, Fujita Health University, Toyoake 470-1192, Japan

⁶

Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA

⁷

Faculty of Engineering, Gifu University, Gifu 501-1193, Japan

^*

Author to whom correspondence should be addressed.

Computers 2024, 13(11), 303; https://doi.org/10.3390/computers13110303

Submission received: 12 October 2024 / Revised: 7 November 2024 / Accepted: 18 November 2024 / Published: 19 November 2024

(This article belongs to the Special Issue Applications of Machine Learning and Artificial Intelligence for Healthcare)

Download

Browse Figures

Versions Notes

Abstract

:

Cytology, a type of pathological examination, involves sampling cells from the human body and observing the morphology of the nucleus, cytoplasm, and cell arrangement. In developing classification AI technologies to support cytology, it is essential to collect and utilize a diverse range of images without bias. However, this is often challenging in practice because of the epidemiologic bias of cancer types and cellular characteristics. The main aim of this study was to develop a method to generate cytological diagnostic images from image findings using text-to-image technology in order to generate diverse images. In the proposed method, we collected Papanicolaou-stained specimens derived from the lung cells of 135 lung cancer patients, from which we extracted 472 patch images. Descriptions of the corresponding findings for these patch images were compiled to create a data set. This dataset was then utilized to finetune the Stable Diffusion (SD) v1 and v2 models. The cell images generated by this method closely resemble real images, and both cytotechnologists and cytopathologists provided positive subjective evaluations. Furthermore, SDv2 demonstrated shapes and contours of nuclei and cytoplasm that were more similar to real images compared to SDv1, showing superior performance in quantitative evaluation metrics. When the generated images were utilized in the classification tasks for cytological images, there was an improvement in classification performance. These results indicate that the proposed method may be effective for generating high-quality cytological images, which enables the image classification model to learn diverse features, thereby improving classification performance.

Keywords:

text to image; deep learning; image generation; image classification; cytology; diffusion model

1. Introduction

Lung cancer has one of the highest mortality rates among cancers and is the leading cause of cancer-related deaths in men [1]. For the diagnosis of lung cancer, after undergoing radiological examinations such as chest X-rays or CT scans, if the presence of lung cancer is suspected, a biopsy is performed to collect tissue or cells for pathological examination. In pathological examinations, a histopathological diagnosis is conducted by observing the structure of tissues, and a cytological diagnosis is made by observing the morphology of cells. A final diagnosis is determined by integrating these results.

In cytological diagnosis, cytotechnologists screen cytology specimens, identify abnormal cells, and communicate the marked cells and the image findings (reports) related to the cytological images to the cytopathologist. The cytopathologist then closely examines these cells and, taking into account other test results, such as the histopathological diagnosis, makes the final diagnosis.

In these diagnoses, it is necessary to examine a large number of cells and determine malignancy or tissue type, requiring efficient diagnosis within a limited time. To support this, numerous image recognition technologies, including AI, have been developed [2,3,4]. For example, Zhang et al. developed a method using a convolutional neural network (CNN) to classify benign and malignant cells in cervical cytology, achieving a classification accuracy of 98.3% [2]. Furthermore, we proposed a method for distinguishing three lung cancer types in lung cytology by the original CNN model, obtaining an accuracy rate of 71% [3]. Additionally, Kaneko et al. developed a classification method for urine cytology images using EfficientNet, which achieved a 95% classification rate [4].

When developing these classification methods, sufficient performance cannot be achieved if there is a bias in the types and characteristics of the cells. Therefore, it is crucial to prepare a large and diverse set of images. However, collecting enough data to achieve satisfactory performance is a challenging task.

If diverse data could be artificially generated, this challenge might be overcome. We have previously used generative adversarial networks (GANs) to generate cytological images as an artificial image generation technique and applied this technique to image classification tasks by generating a large number of pseudo-benign and malignant cytological images for pre-training image classification models to improve classification accuracy [5]. In another of our previous studies, we utilized GANs to generate training data specifically for the classification of idiopathic interstitial pneumonias (IIPs), a rare disease, successfully enhancing the performance of the classification model through this approach [6]. However, in our prior research using GANs, since images were generated from random noise, it was difficult to generate images with the desired characteristics, resulting in image generation that was dependent on the characteristics and distribution of the original images. The images are essentially generated randomly; therefore, it is challenging to control the characteristics of the generated images.

In recent years, text-to-image technology, which generates images from textual instructions, has been developed. For example, Rombach et al. developed a method that provides textual information to a diffusion model [7,8], generating high-resolution images [9]. This technology, known as Stable Diffusion, is widely used for creating photographs and illustrations. Similarly, Imagen, developed by Saharia et al. [10], and DALL-E, developed by Ramesh et al. [11], are also text-to-image models that utilize diffusion models.

In contrast to GANs, text-to-image technology allows the output of images that match the specified content, helping to prevent biases in the characteristics of the generated images. These text-to-image technologies are also being explored for applications in medical image processing. For example, Kaleta et al. generated images from text during laparoscopic surgery and applied them to segmentation tasks for objects within the images, achieving good performance [12]. In the context of chest imaging, Chambon et al. generated X-ray images from text and applied them to detect abnormalities [13].

However, to the best of our knowledge, there have been no reports on text-to-image research related to cytology. Therefore, in this study, we develop a method to generate lung cytological images from descriptive reports of imaging findings using text-to-image technology and demonstrate its effectiveness through objective evaluation, visual evaluation, and application to image classification tasks. The main contributions of this study are as follows:

A New Approach for Cytology: This is the first attempt to generate cytological images from descriptive reports of cytological findings, contributing to the development of AI for cytology and assisting cytotechnologists and cytopathologists in their diagnoses and education.
Generation of High-Quality Images: Through objective evaluation and visual evaluation of the images, we clarified the quality characteristics of the generated images. The generated images are relatively close to real images.
Application to Other AI Tasks: We applied the generated images to image classification tasks, demonstrating that the use of generated images contributes to improved classification performance and showcasing a new application of text-to-image technology.

2. Materials and Methods

2.1. Outline

The outline of this study is shown in Figure 1. In this study, we conducted finetuning of the Stable Diffusion (SD) model, a text-to-image technology, to generate cytological images from their findings. The generated images underwent quantitative and visual evaluations. Furthermore, we applied the generated diverse cytological images to classification tasks to verify their effectiveness.

2.2. Dataset

For this study, we collected lung cancer cells from 135 patients through interventional cytology techniques, specifically utilizing bronchoscopy or computed-tomography-guided fine-needle aspiration. Among these, there were 83 cases of adenocarcinoma and 52 cases of squamous cell carcinoma. The final diagnosis was confirmed by combining these cytological findings with histological analysis of biopsy samples. The cytological samples were processed using liquid-based cytology via the BD SurePath™ Pap test (Beckton Dickinson, Franklin Lakes, NJ, USA) and stained using the Papanicolaou method. A microscope (BX53, Olympus Corporation, Tokyo, Japan) equipped with a digital camera (DP20, Olympus Corporation) was used to capture 460 cell images in JPEG format at a resolution of 1280 × 960 pixels.

Figure 2 illustrates the creation of the image dataset for image generation. A cytotechnologist and a cytopathologist selected a patch image measuring 296 × 296 pixels from the original microscopic images, focusing on areas containing cells. The resulting dataset comprised 192 images of adenocarcinoma and 280 images of squamous cell carcinoma. Image findings were prepared for these images, highlighting features such as cell type, nucleus morphology, cell arrangement, and the background conditions unrelated to the target cells. These image findings were written by a cytotechnologist and a cytopathologist, following the World Health Organization (WHO, Geneva, Switzerland) guidelines for pulmonary cytopathology reporting [14]. Figure 3 presents a sample from the dataset we made.

Finally, the datasets were randomly divided into training and evaluation datasets. The training dataset consisted of 151 adenocarcinoma images and 221 squamous cell carcinoma images, while the evaluation dataset comprised 41 adenocarcinoma images and 59 squamous cell carcinoma images. Images from the same patient were not included in both the training and evaluation datasets.

2.3. Text-to-Image Model

In this study, we used the SD model to generate cytological images. The SD model consists of a variational autoencoder (VAE) [15], a diffusion model, and a text conditioning mechanism. The VAE includes an encoder and a decoder; the encoder compresses high-dimensional image data into a low-dimensional latent space, while the decoder restores the original image from the latent space. The values processed in the hidden layer of the VAE are referred to as the latent space, which captures the features of the images. Once the model is trained, the decoder can generate various images by providing features to the latent space.

The diffusion model is a generative AI model inspired by diffusion phenomena in physics. This model consists of a diffusion process and an inverse diffusion process. In the diffusion process, noise is gradually added to the original image until it becomes random noise. In the inverse diffusion process, noise is gradually removed from the random noise to reconstruct an image. The text provided to the SD model is encoded by CLIP [10] and embedded into the diffusion model, allowing it to remove noise based on the input text.

2.4. Training of Stable Diffusion

The SD model is trained using an image-text dataset. However, since the training dataset does not include microscopic images, it is necessary to finetune the model using a dataset specifically created for this study. For finetuning, there are several methods, including the standard approach of directly adjusting the weights within the network, which is common in deep learning. Additionally, techniques such as textual inversion adjust only the text decoder to learn new words [16], hypernetworks that add new layers to the SD model and train only those parts [17], and Low-Rank Adaptation (LoRA) [18] are available. Since the cytological images targeted in this study were not included at all in the pre-training of the SD model, it was necessary to adjust the entire model. Therefore, among the aforementioned finetuning methods, we adopted the most classical approach of finetuning the entire network.

In this study, we employed two versions of the SD model: the v1 model (SDv1) and the v2 model (SDv2) to perform a comparative analysis of their performance. Additionally, when generating images, a parameter called classifier-free guidance (CFG) was used. This parameter represents the degree to which the generated image reflects the input text and takes a value from 0 to 100. In this study, images were generated by varying the CFG during the experiments.

The parameters for training were set with a learning rate of 1 × 10⁻⁶, and the optimization algorithm used was AdamW, with training conducted over 50 epochs. The input image was normalized so that pixel values take values ranging from −1 to 1. The program for training and inference was implemented using TensorFlow 2, and the processing was carried out on a PC equipped with an NVIDIA RTX 6000 Ada GPU.

2.5. Objective Evaluation

To evaluate the quality of the generated images, we introduced fundamental quality characteristics and established evaluation metrics. For the former, the target color images were converted into three components: hue, saturation, and value, and histograms for each of these components were calculated individually. Additionally, to conduct a comprehensive comparison between the generated and real images, the histograms calculated from multiple images were averaged within each evaluation group to create relative histograms. Based on this, the histogram similarity for the three components between the generated and real images were calculated using the following correlation coefficient formula:

d (H_{1}, H_{2}) = \frac{\sum_{I} (H_{1} (I) - \bar{H_{1}}) (H_{2} (I) - \bar{H_{2}})}{\sqrt{\sum_{I} {(H_{1} (I) - \bar{H_{1}})}^{2} \sum_{I} {(H_{2} (I) - \bar{H_{2}})}^{2}}}

(1)

\bar{H_{k}} = \frac{1}{N} \sum_{J} H_{k} (J)

(2)

where H₁ and H₂ represent the histograms of the generated images and the real images, respectively. For the latter, we adopted the Fréchet inception distance (FID) [19] and kernel inception distance (KID) [20]. These metrics calculate the similarity between two groups of images based on feature vectors extracted from a pre-trained inception model, with smaller values indicating higher similarity between the two groups of images.

2.6. Visual Evaluation

To further evaluate the quality of the generated images, a visual assessment was conducted. The images targeted for evaluation included 20 images each of adenocarcinoma and squamous cell carcinoma, totaling 120 images for SDv1, SDv2, and real images. The generated images from SD were obtained by providing the corresponding image findings to SD.

The visual quality assessment was carried out by two cytopathologists and two cytotechnologists. The evaluation criteria included a total of three items focused on cytology: authenticity of the cell nucleus (including external shape, nucleolus state, and chromatin appearance), cytoplasm quality, and cell layout. Each criterion was scored on a scale from 0 to 100, in increments of 10.

2.7. Cell Classification Using Generated Images

One of the applications of the images generated by the proposed method is their use in classification tasks. This study specifically focused on lung cancer cells, targeting adenocarcinoma and squamous cell carcinoma, which are often challenging to classify in cytology. The objective was to develop a deep learning model capable of distinguishing between these two cell types. The classification model was trained and evaluated using both real images and the generated images.

The classification models employed in this study include state-of-the-art architecture such as VGG-16 [21], InceptionV3 [22], ResNet50 [23], DenseNet121 [24], and Vision Transformer (Base model with 16 × 16 patches) [25]. Each of these models was pre-trained on the ImageNet database. We replaced the original fully connected layer, which was designed for classification into 1000 categories, with a multi-layer perceptron consisting of a hidden layer with 1024 neurons and an output layer with two neurons. The entire network was then finetuned using the prepared image dataset. To assess the impact of using generated images in the training process, the data provision methods for the classification models were categorized into four types:

Training with Real Images Only: The model was trained solely using real images.
Training with Generated Images Only: The model was trained using only the generated images.
Training with Mixed Images: The model was trained using a mixture of both real and generated images.
2-STEP Training: The model was first trained solely on generated images and then underwent additional training using real images.

The fourth approach, 2-STEP Training, builds on our previous research [5], which showed promising results. This method is particularly advantageous when generated images and real images are not completely equivalent. By initially training the model on generated images to establish a coarse classification, we can then finetune it using real images for final adjustments. This two-step training process enhances the model’s overall performance and improves classification accuracy.

For the training dataset, we utilized 151 images of adenocarcinoma and 221 images of squamous cell carcinoma. For evaluation, we employed 41 images of adenocarcinoma and 59 images of squamous cell carcinoma. Using the corresponding findings for a total of 371 training images, we generated 10 images per finding with SDv2, resulting in a total of 3710 generated images, which were used for training.

For the training parameters, we set the learning rate to 1 × 10⁻⁶ and employed the SGD optimization algorithm; the input image was normalized so that pixel values took values between 0 and 1. We randomly selected 10% of the training data for validation, enabling early stopping based on the validation loss, with a maximum of 200 epochs for training. To account for variability during training, we performed training and evaluation five times, calculating the mean and standard deviation of the ROC curve’s AUC, sensitivity, and specificity, as well as balanced accuracy and F1-score. The code for training and evaluation was implemented using TensorFlow 2, with processing conducted on a PC equipped with an NVIDIA RTX 6000 Ada GPU.

3. Results

First, generated images obtained by SDv1 and SDv2 are shown in Figure 4. The outputs include cell images of adenocarcinoma and squamous cell carcinoma, with the CFG varied from 5 to 60 using the same findings. As a preliminary study, cytopathologists and cytotechnologists evaluated five different images by varying the CFG to determine the most preferred CFG value for each model. The median preferred CFG value was 10 for SDv1 and 20 for SDv2, based on their feedback. Consequently, we set the CFG values to 10 and 20 for SDv1 and SDv2, respectively, for use in our evaluations.

Next, using the 3710 generated images, we calculated the histogram similarity of hue, saturation, and value, as well as the FID and KID, with the results shown in Figure 5 and Figure 6, respectively. The CFG also varied when calculating these evaluation metrics.

For the visual evaluation, we present a box plot (Figure 7) displaying the scores assigned by four evaluators regarding the authenticity of the cell nucleus, cytoplasm, and cell arrangement. Additionally, the results of Welch’s t-test, which assesses the significant differences among the real images, SDv1, and SDv2 for each tester, are summarized in Table 1. Furthermore, Table 2 illustrates the outcomes of applying the images generated by the proposed method to the classification of cell images and evaluates the classification performance.

4. Discussion

In this study, we proposed a method to generate cell images from cytological findings using text-to-image technology, validating its effectiveness through both quantitative and visual evaluations. The images generated, as shown in Figure 4, confirmed that faithful representations of cells were produced based on the input findings. Notably, the SDv1 model generated artificially enhanced images with more pronounced outlines and contrasts as the CFG increased. In contrast, SDv2 exhibited minimal changes in image quality across varying CFG levels. Visual assessments indicated that SDv1 produced the most realistic images at a CFG of 10, while SDv2 was deemed most realistic at a CFG of 20. Under these conditions, we evaluated three main components of the cell images: the nucleus, cytoplasm, and cell arrangement. The real images received the highest scores, followed by those from SDv2 and then SDv1. Interestingly, in several evaluation metrics performed by cytopathologists, no significant differences were noted between real and SDv2 images, suggesting that SDv2 achieved a high level of image quality. Conversely, results from cytotechnologists revealed significant differences between the real and generated images, regardless of whether they were produced by SDv1 or SDv2. This discrepancy may arise from the cytotechnologists’ extensive experience in screening a large number of cells, enabling them to distinguish more effectively between real and generated images. Furthermore, the cytoplasmic representation differed notably between the two models: SDv2, trained on a larger dataset, achieved a more natural cytoplasmic depiction with less emphasis on artificial outlines. This is believed to contribute to SDv2’s relatively high scores for cytoplasm. In contrast, for features such as the nucleus and cell arrangement, SDv1 and SDv2 exhibited comparable performance, as these components rely more on fundamental structural characteristics, which both models could represent adequately.

In terms of quantitative evaluation, SDv2 demonstrated superior characteristics compared to SDv1, as indicated by image quality metrics such as FID and KID, which are commonly used for assessing generative AI performance. Notably, the higher-quality SDv2 achieved optimal results for both metrics at a CFG of 20, which aligned with the visually assessed optimal conditions. Conversely, SDv1 exhibited a significant discrepancy between quantitative and visual evaluations. This inconsistency may stem from considerable differences in edges and color tones between the real images and those generated by SDv1, leading to an inaccurate overall similarity calculation. When examining basic image quality characteristics, such as hue, saturation, and brightness, SDv1 showed minimal changes with varying CFG levels. In contrast, for SDv2, it was observed that as CFG increased, the consistency of hue decreased while the consistency of brightness improved. This suggests that although CFG is a parameter controlling the fidelity of the generated images to the provided text, increasing the CFG may shift the model’s focus toward aligning the brightness distribution with that of the real images rather than maintaining consistency in hue.

The results in Table 2 demonstrate that our proposed methods, “2-STEP training” and “Mixed”, consistently achieved higher AUC and F1 scores compared to traditional approaches such as “Real only” and “Generated only”. This improvement is particularly notable in models such as VGG16 and Inception V3, where “2-STEP training” yielded the highest performance metrics, indicating the effectiveness of our approach in leveraging both real and synthetic data. Specifically, “2-STEP training” allowed models to benefit from the unique features present in generated data while fine-tuning with real data for enhanced accuracy and robustness. Furthermore, the stability of our proposed methods is evidenced by the variability across repeated evaluations, which remained under 3% for most metrics, underscoring the reliability of these methods.

Among several classification models, the vision transformer showed the best classification performance. Vision transformer is an image classification model that divides images into small patches and allows the transformer to analyze them, including their individual relationships. A cytological image contains many cells. We believe that a high classification accuracy rate was achieved by accurately identifying the relationship between these cells and the cells to be focused on. Furthermore, in many CNN models, high accuracy was achieved through two-step training, whereas the vision transformer achieved the best accuracy by mixing real and generated images. It is said that vision transformers can perform more advanced classification than CNNs due to the attention mechanism. We think that the vision transformer achieves high performance in a single training session, even when images with slightly different characteristics are mixed together.

In our previous study, the correct classification rate for lung cancer histology was 71% [3]; although this rate is not comparable under the same conditions. Since the analysis results show that the rate of correct classification by cytopathologists is comparable to that in the previous study, we can conclude that the rate of correct classification in this study is considerably high. Since the effect of the generated images was also obtained, there is a high possibility that this method will contribute to improving the accuracy of cytological diagnosis. In clinical settings, distinguishing between adenocarcinoma and squamous cell carcinoma can often be challenging, with some cases requiring tissue or immunohistochemical samples that take time to prepare for a definitive diagnosis. If our proposed method can accurately classify these types, it could expedite diagnoses, enabling earlier treatment initiation. Additionally, due to a shortage of cytotechnologists in many hospitals, integrating this method as one part of the double-checking process could enhance diagnostic quality and streamline the workflow, potentially improving the efficiency and reliability of cytological assessments.

From the above experimental results, it was demonstrated that various realistic cytological images can be generated from image findings, and although there are differences in image quality between generated and real images, innovative usage can achieve better classification performance.

The method also has multiple potential applications. First, the generated images could be used as an educational tool in actual medical practice. In cytology education, it is difficult to prepare a large number of images of different cases and cell types. Using this technology, high-quality cell images based on a variety of cases can be easily generated to compensate for the lack of data in education. The generated images can play an important role in the training of cytotechnologists and cytopathologists, improving the accuracy of training in a field where the number of cases is limited.

The results of this study also demonstrate the effectiveness of data augmentation with generated images, which may contribute to solving data shortages, especially in the training of deep learning models. In medical imaging, where data collection is often difficult due to patient privacy and ethical issues, it is expected that generated images can be used to expand the dataset and improve the performance of AI models. In particular, the two-step training method implemented in this study has been shown to improve classification accuracy by performing coarse classification on the generated images, followed by finetuning on the real images. We believe that this method is applicable to other diagnostic domains and image analysis tasks. Furthermore, it also has potential applications beyond clinical diagnosis to educational settings, providing a valuable tool for teaching and developing diagnostic skills through the generation of diverse training datasets.

One limitation of this study is the variability observed between evaluators in subjective assessments. While the primary diagnostic components—such as the nucleus, cytoplasm, and cell arrangement—were evaluated based on previous methods [26], allowing us to identify general trends, future efforts should focus on reducing observer variability and refining evaluation criteria to achieve more accurate assessments. Additionally, a detailed comparison with traditional generative methods, such as GANs, is necessary. In this study, the pre-trained image generation model was fine-tuned with a limited number of images, whereas GANs typically require training from scratch with a large dataset, making such direct fine-tuning challenging. To accurately compare the performance of the proposed method with that of GANs, a substantial amount of image data will be required, which remains a task for future work. Furthermore, this study focused on adenocarcinoma and squamous cell carcinoma of the lung, but the development of image generation and classification methods for cancers of other organs is also needed. Following these investigations, advancing clinical applications will require the development of software that can be used in clinical and educational settings.

5. Conclusions

In this study, we generated lung cytological images from image findings using text-to-image technology and evaluated their effectiveness. The generated images demonstrated acceptable quality and their inclusion as training data enhanced the performance of image classification tasks by approximately 5%. These results suggest that the proposed method is effective both for generating cytological images and for improving tissue type classification. Moreover, this approach shows great potential for practical applications in scenarios with rare or limited data, offering a valuable resource for model training in data-scarce environments. Additionally, it could be effectively utilized in educational settings to support the teaching and development of diagnostic skills by providing diverse and high-quality training datasets.

Author Contributions

Conceptualization, A.T., T.T. and H.F.; data curation, K.I., Y.K. and N.Y.; methodology, A.T.; software, A.T.; validation, T.T., Y.K., A.M., N.Y. and A.T.; investigation, A.T., Y.K., T.T. and H.F.; writing—original draft preparation, A.T.; writing—review and editing, T.T., K.I. and H.F.; visualization, A.T.; project administration, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by a Grant-in-Aid for Scientific Research (No. 23K07117) from the Ministry of Education, Culture, Sports, Science and Technology, Japan.

Institutional Review Board Statement

This study was approved by the Ethical Review Committee of Fujita Health University (HM23-390) and was carried out in accordance with the World Medical Association’s Declaration of Helsinki.

Informed Consent Statement

Informed consent was obtained in the form of an opt-out at Fujita Health University Hospital, and all data were anonymized.

Data Availability Statement

The source code and additional information used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

American Cancer Society. “Cancer Facts and Figures 2023”. Available online: https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2023/2023-cancer-facts-and-figures.pdf (accessed on 19 November 2023).
Zhang, L.; Wu, Y.; Zhao, H.; Hu, J.; Wang, Y. DeepPap: Deep convolutional networks for cervical cell classification. IEEE J. Biomed. Health Inform. 2017, 21, 1633–1643. [Google Scholar] [CrossRef] [PubMed]
Teramoto, A.; Tsukamoto, T.; Kiriyama, Y.; Fujita, H. Automated classification of lung cancer types from cytological images using deep convolutional neural networks. BioMed Res. Int. 2017, 2017, 4067832. [Google Scholar] [CrossRef] [PubMed]
Kaneko, M.; Tsuji, K.; Masuda, K.; Ueno, K.; Henmi, K.; Nakagawa, S.; Fujita, R.; Suzuki, K.; Inoue, Y.; Teramukai, S.; et al. Urine Cell Image Recognition Using a Deep-learning Model for an Automated Slide Evaluation System. BJU Int. 2022, 130, 235–243. [Google Scholar] [CrossRef] [PubMed]
Teramoto, A.; Tsukamoto, T.; Yamada, A.; Kiriyama, Y.; Imaizumi, K.; Saito, K.; Fujita, H. Deep learning approach to classification of lung cytological images: Two-step training using actual and synthesized images by progressive growing of generative adversarial networks. PLoS ONE 2020, 15, e0229951. [Google Scholar] [CrossRef]
Teramoto, A.; Tsukamoto, T.; Michiba, A.; Kiriyama, Y.; Sakurai, E.; Imaizumi, K.; Saito, K.; Fujita, H. Automated Classification of Idiopathic Pulmonary Fibrosis in Pathological Images Using Convolutional Neural Network and Generative Adversarial Networks. Diagnostics 2022, 12, 3195. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Online, 3–7 May 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Kaleta, J.; Dall’Alba, D.; Płotka, S.; Korzeniowski, P. Minimal data requirement for realistic endoscopic image generation with Stable Diffusion. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 531–539. [Google Scholar] [CrossRef]
Chambon, P.; Bluethgen, C.; Langlotz, C.P.; Chaudhari, A. Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains. arXiv 2022, arXiv:2210.04133. [Google Scholar]
Schmitt, F.; Bubendorf, L.; Canberk, S.; Chandra, A.; Cree, I.; Engels, M.; Hiroshima, K.; Jain, D.; Kholová, I.; Layfield, L.; et al. The World Health Organization Reporting, System for Lung Cytopathology. Acta Cytol. 2023, 67, 80–91. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Wei, W.; Hou, T.; Pritch, Y.; Wadhwa, N.; Rubinstein, M.; Aberman, K. HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. arXiv 2023, arXiv:2307.06949. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Obukhov, A.; Krasnyanskiy, M. Quality Assessment Method for GAN Based on Modified Metrics Inception Score and Fréchet Inception Distance. In Software Engineering Perspectives in Intelligent Systems; Silhavy, R., Silhavy, P., Prokopova, Z., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 102–114. [Google Scholar]
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying mmd gans. arXiv 2018, arXiv:1801.01401. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 2015, 2015, 1–9. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 2016, 2016, 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 2017, 2017, 2261–2269. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Teramoto, A.; Yamada, A.; Tsukamoto, T.; Kiriyama, Y.; Sakurai, A.; Shiogama, K.; Michiba, A.; Imaizumi, K.; Saito, K.; Fujita, H. Mutual Stain Conversion between Giemsa and Papanicolaou in Cytological Images Using Cycle Generative Adversarial Network. Heliyon 2021, 7, e06331. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Outline of this study.

Figure 2. Preparation of patch images. Digitized images were captured under a microscope, and experts selected the areas of the cell regions to be output as image findings, which were then converted into patch images.

Figure 3. Four examples of patch images and the corresponding image findings.

Figure 4. Generated images using proposed method.

Figure 5. Results of similarity evaluation using histogram correlation in three-color component.

Figure 6. Results of similarity evaluation in feature representations.

Figure 7. Results of visual evaluation. A box plots of the scores evaluated by four members regarding the cell nucleus (a), cytoplasm (b), and cell arrangement (c).

Table 1. Significance difference test results for generated images.

		p-Value
		Cytopathologist1	Cytopathologist2	Cytotechnologist1	Cytotechnologist2	Overall
Nucleus	Real-SDv2	0.213	0.864	0.000	0.000	0.001
	Real-SDv1	0.026	0.000	0.000	0.000	0.004
	SDv2-SDv1	0.255	0.000	0.195	0.058	0.697
Cytoplasm	Real-SDv2	0.045	0.439	0.009	0.011	0.006
	Real-SDv1	0.003	0.029	0.000	0.000	0.000
	SDv2-SDv1	0.306	0.053	0.018	0.009	0.038
Cell layout	Real-SDv2	0.023	0.640	0.021	0.002	0.001
	Real-SDv1	0.009	0.103	0.002	0.000	0.000
	SDv2-SDv1	0.695	0.090	0.311	0.323	0.505

Table 2. Classification performances using real and generated images. Bold type indicates the training type with the highest AUC for each classification model.

AI Model	Training Type	AUC	Sensitivity	Specificity	Balanced Accuracy	F1-Score
VGG 16	Real only	0.738 ± 0.038	0.641 ± 0.028	0.732 ± 0.034	0.686 ± 0.028	0.701 ± 0.027
	Generated only	0.730 ± 0.025	0.705 ± 0.070	0.663 ± 0.068	0.684 ± 0.012	0.726 ± 0.029
	Mixed	0.756 ± 0.019	0.637 ± 0.065	0.712 ± 0.072	0.675 ± 0.020	0.692 ± 0.034
	2-STEP training	0.808 ± 0.027	0.715 ± 0.082	0.737 ± 0.053	0.726 ± 0.022	0.752 ± 0.045
Inception V3	Real only	0.807 ± 0.032	0.658 ± 0.064	0.761 ± 0.076	0.709 ± 0.058	0.721 ± 0.056
	Generated only	0.678 ± 0.078	0.749 ± 0.075	0.541 ± 0.076	0.645 ± 0.043	0.723 ± 0.044
	Mixed	0.783 ± 0.023	0.744 ± 0.052	0.688 ± 0.058	0.716 ± 0.021	0.751 ± 0.018
	2-STEP training	0.846 ± 0.014	0.749 ± 0.022	0.805 ± 0.046	0.777 ± 0.028	0.795 ± 0.022
ResNet50	Real only	0.831 ± 0.010	0.708 ± 0.028	0.820 ± 0.028	0.764 ± 0.016	0.772 ± 0.018
	Generated only	0.675 ± 0.049	0.780 ± 0.119	0.424 ± 0.132	0.602 ± 0.052	0.712 ± 0.059
	Mixed	0.830 ± 0.013	0.742 ± 0.074	0.751 ± 0.087	0.747 ± 0.025	0.774 ± 0.033
	2-STEP training	0.815 ± 0.036	0.715 ± 0.095	0.766 ± 0.033	0.741 ± 0.042	0.759 ± 0.060
DenseNet121	Real only	0.836 ± 0.013	0.698 ± 0.044	0.820 ± 0.076	0.759 ± 0.030	0.766 ± 0.024
	Generated only	0.729 ± 0.024	0.688 ± 0.126	0.654 ± 0.083	0.671 ± 0.032	0.709 ± 0.068
	Mixed	0.827 ± 0.008	0.753 ± 0.037	0.805 ± 0.052	0.779 ± 0.009	0.797 ± 0.010
	2-STEP training	0.871 ± 0.013	0.759 ± 0.037	0.815 ± 0.033	0.787 ± 0.018	0.799 ± 0.023
ViT base 16 × 16	Real only	0.853 ± 0.012	0.753 ± 0.077	0.761 ± 0.036	0.757 ± 0.025	0.783 ± 0.040
	Generated only	0.849 ± 0.021	0.831 ± 0.063	0.722 ± 0.123	0.776 ± 0.032	0.821 ± 0.009
	Mixed	0.893 ± 0.009	0.790 ± 0.044	0.829 ± 0.057	0.810 ± 0.013	0.827 ± 0.013
	2-STEP training	0.882 ± 0.022	0.803 ± 0.054	0.810 ± 0.047	0.807 ± 0.030	0.830 ± 0.032

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Teramoto, A.; Kiriyama, Y.; Michiba, A.; Yazawa, N.; Tsukamoto, T.; Imaizumi, K.; Fujita, H. Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology. Computers 2024, 13, 303. https://doi.org/10.3390/computers13110303

AMA Style

Teramoto A, Kiriyama Y, Michiba A, Yazawa N, Tsukamoto T, Imaizumi K, Fujita H. Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology. Computers. 2024; 13(11):303. https://doi.org/10.3390/computers13110303

Chicago/Turabian Style

Teramoto, Atsushi, Yuka Kiriyama, Ayano Michiba, Natsuki Yazawa, Tetsuya Tsukamoto, Kazuyoshi Imaizumi, and Hiroshi Fujita. 2024. "Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology" Computers 13, no. 11: 303. https://doi.org/10.3390/computers13110303

APA Style

Teramoto, A., Kiriyama, Y., Michiba, A., Yazawa, N., Tsukamoto, T., Imaizumi, K., & Fujita, H. (2024). Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology. Computers, 13(11), 303. https://doi.org/10.3390/computers13110303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automated Generation of Lung Cytological Images from Image Findings Using Text-to-Image Technology

Abstract

1. Introduction

2. Materials and Methods

2.1. Outline

2.2. Dataset

2.3. Text-to-Image Model

2.4. Training of Stable Diffusion

2.5. Objective Evaluation

2.6. Visual Evaluation

2.7. Cell Classification Using Generated Images

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI