1. Introduction
Lung cancer has one of the highest mortality rates among cancers and is the leading cause of cancer-related deaths in men [
1]. For the diagnosis of lung cancer, after undergoing radiological examinations such as chest X-rays or CT scans, if the presence of lung cancer is suspected, a biopsy is performed to collect tissue or cells for pathological examination. In pathological examinations, a histopathological diagnosis is conducted by observing the structure of tissues, and a cytological diagnosis is made by observing the morphology of cells. A final diagnosis is determined by integrating these results.
In cytological diagnosis, cytotechnologists screen cytology specimens, identify abnormal cells, and communicate the marked cells and the image findings (reports) related to the cytological images to the cytopathologist. The cytopathologist then closely examines these cells and, taking into account other test results, such as the histopathological diagnosis, makes the final diagnosis.
In these diagnoses, it is necessary to examine a large number of cells and determine malignancy or tissue type, requiring efficient diagnosis within a limited time. To support this, numerous image recognition technologies, including AI, have been developed [
2,
3,
4]. For example, Zhang et al. developed a method using a convolutional neural network (CNN) to classify benign and malignant cells in cervical cytology, achieving a classification accuracy of 98.3% [
2]. Furthermore, we proposed a method for distinguishing three lung cancer types in lung cytology by the original CNN model, obtaining an accuracy rate of 71% [
3]. Additionally, Kaneko et al. developed a classification method for urine cytology images using EfficientNet, which achieved a 95% classification rate [
4].
When developing these classification methods, sufficient performance cannot be achieved if there is a bias in the types and characteristics of the cells. Therefore, it is crucial to prepare a large and diverse set of images. However, collecting enough data to achieve satisfactory performance is a challenging task.
If diverse data could be artificially generated, this challenge might be overcome. We have previously used generative adversarial networks (GANs) to generate cytological images as an artificial image generation technique and applied this technique to image classification tasks by generating a large number of pseudo-benign and malignant cytological images for pre-training image classification models to improve classification accuracy [
5]. In another of our previous studies, we utilized GANs to generate training data specifically for the classification of idiopathic interstitial pneumonias (IIPs), a rare disease, successfully enhancing the performance of the classification model through this approach [
6]. However, in our prior research using GANs, since images were generated from random noise, it was difficult to generate images with the desired characteristics, resulting in image generation that was dependent on the characteristics and distribution of the original images. The images are essentially generated randomly; therefore, it is challenging to control the characteristics of the generated images.
In recent years, text-to-image technology, which generates images from textual instructions, has been developed. For example, Rombach et al. developed a method that provides textual information to a diffusion model [
7,
8], generating high-resolution images [
9]. This technology, known as Stable Diffusion, is widely used for creating photographs and illustrations. Similarly, Imagen, developed by Saharia et al. [
10], and DALL-E, developed by Ramesh et al. [
11], are also text-to-image models that utilize diffusion models.
In contrast to GANs, text-to-image technology allows the output of images that match the specified content, helping to prevent biases in the characteristics of the generated images. These text-to-image technologies are also being explored for applications in medical image processing. For example, Kaleta et al. generated images from text during laparoscopic surgery and applied them to segmentation tasks for objects within the images, achieving good performance [
12]. In the context of chest imaging, Chambon et al. generated X-ray images from text and applied them to detect abnormalities [
13].
However, to the best of our knowledge, there have been no reports on text-to-image research related to cytology. Therefore, in this study, we develop a method to generate lung cytological images from descriptive reports of imaging findings using text-to-image technology and demonstrate its effectiveness through objective evaluation, visual evaluation, and application to image classification tasks. The main contributions of this study are as follows:
A New Approach for Cytology: This is the first attempt to generate cytological images from descriptive reports of cytological findings, contributing to the development of AI for cytology and assisting cytotechnologists and cytopathologists in their diagnoses and education.
Generation of High-Quality Images: Through objective evaluation and visual evaluation of the images, we clarified the quality characteristics of the generated images. The generated images are relatively close to real images.
Application to Other AI Tasks: We applied the generated images to image classification tasks, demonstrating that the use of generated images contributes to improved classification performance and showcasing a new application of text-to-image technology.
2. Materials and Methods
2.1. Outline
The outline of this study is shown in
Figure 1. In this study, we conducted finetuning of the Stable Diffusion (SD) model, a text-to-image technology, to generate cytological images from their findings. The generated images underwent quantitative and visual evaluations. Furthermore, we applied the generated diverse cytological images to classification tasks to verify their effectiveness.
2.2. Dataset
For this study, we collected lung cancer cells from 135 patients through interventional cytology techniques, specifically utilizing bronchoscopy or computed-tomography-guided fine-needle aspiration. Among these, there were 83 cases of adenocarcinoma and 52 cases of squamous cell carcinoma. The final diagnosis was confirmed by combining these cytological findings with histological analysis of biopsy samples. The cytological samples were processed using liquid-based cytology via the BD SurePath™ Pap test (Beckton Dickinson, Franklin Lakes, NJ, USA) and stained using the Papanicolaou method. A microscope (BX53, Olympus Corporation, Tokyo, Japan) equipped with a digital camera (DP20, Olympus Corporation) was used to capture 460 cell images in JPEG format at a resolution of 1280 × 960 pixels.
Figure 2 illustrates the creation of the image dataset for image generation. A cytotechnologist and a cytopathologist selected a patch image measuring 296 × 296 pixels from the original microscopic images, focusing on areas containing cells. The resulting dataset comprised 192 images of adenocarcinoma and 280 images of squamous cell carcinoma. Image findings were prepared for these images, highlighting features such as cell type, nucleus morphology, cell arrangement, and the background conditions unrelated to the target cells. These image findings were written by a cytotechnologist and a cytopathologist, following the World Health Organization (WHO, Geneva, Switzerland) guidelines for pulmonary cytopathology reporting [
14].
Figure 3 presents a sample from the dataset we made.
Finally, the datasets were randomly divided into training and evaluation datasets. The training dataset consisted of 151 adenocarcinoma images and 221 squamous cell carcinoma images, while the evaluation dataset comprised 41 adenocarcinoma images and 59 squamous cell carcinoma images. Images from the same patient were not included in both the training and evaluation datasets.
2.3. Text-to-Image Model
In this study, we used the SD model to generate cytological images. The SD model consists of a variational autoencoder (VAE) [
15], a diffusion model, and a text conditioning mechanism. The VAE includes an encoder and a decoder; the encoder compresses high-dimensional image data into a low-dimensional latent space, while the decoder restores the original image from the latent space. The values processed in the hidden layer of the VAE are referred to as the latent space, which captures the features of the images. Once the model is trained, the decoder can generate various images by providing features to the latent space.
The diffusion model is a generative AI model inspired by diffusion phenomena in physics. This model consists of a diffusion process and an inverse diffusion process. In the diffusion process, noise is gradually added to the original image until it becomes random noise. In the inverse diffusion process, noise is gradually removed from the random noise to reconstruct an image. The text provided to the SD model is encoded by CLIP [
10] and embedded into the diffusion model, allowing it to remove noise based on the input text.
2.4. Training of Stable Diffusion
The SD model is trained using an image-text dataset. However, since the training dataset does not include microscopic images, it is necessary to finetune the model using a dataset specifically created for this study. For finetuning, there are several methods, including the standard approach of directly adjusting the weights within the network, which is common in deep learning. Additionally, techniques such as textual inversion adjust only the text decoder to learn new words [
16], hypernetworks that add new layers to the SD model and train only those parts [
17], and Low-Rank Adaptation (LoRA) [
18] are available. Since the cytological images targeted in this study were not included at all in the pre-training of the SD model, it was necessary to adjust the entire model. Therefore, among the aforementioned finetuning methods, we adopted the most classical approach of finetuning the entire network.
In this study, we employed two versions of the SD model: the v1 model (SDv1) and the v2 model (SDv2) to perform a comparative analysis of their performance. Additionally, when generating images, a parameter called classifier-free guidance (CFG) was used. This parameter represents the degree to which the generated image reflects the input text and takes a value from 0 to 100. In this study, images were generated by varying the CFG during the experiments.
The parameters for training were set with a learning rate of 1 × 10−6, and the optimization algorithm used was AdamW, with training conducted over 50 epochs. The input image was normalized so that pixel values take values ranging from −1 to 1. The program for training and inference was implemented using TensorFlow 2, and the processing was carried out on a PC equipped with an NVIDIA RTX 6000 Ada GPU.
2.5. Objective Evaluation
To evaluate the quality of the generated images, we introduced fundamental quality characteristics and established evaluation metrics. For the former, the target color images were converted into three components: hue, saturation, and value, and histograms for each of these components were calculated individually. Additionally, to conduct a comprehensive comparison between the generated and real images, the histograms calculated from multiple images were averaged within each evaluation group to create relative histograms. Based on this, the histogram similarity for the three components between the generated and real images were calculated using the following correlation coefficient formula:
where
H1 and
H2 represent the histograms of the generated images and the real images, respectively. For the latter, we adopted the Fréchet inception distance (FID) [
19] and kernel inception distance (KID) [
20]. These metrics calculate the similarity between two groups of images based on feature vectors extracted from a pre-trained inception model, with smaller values indicating higher similarity between the two groups of images.
2.6. Visual Evaluation
To further evaluate the quality of the generated images, a visual assessment was conducted. The images targeted for evaluation included 20 images each of adenocarcinoma and squamous cell carcinoma, totaling 120 images for SDv1, SDv2, and real images. The generated images from SD were obtained by providing the corresponding image findings to SD.
The visual quality assessment was carried out by two cytopathologists and two cytotechnologists. The evaluation criteria included a total of three items focused on cytology: authenticity of the cell nucleus (including external shape, nucleolus state, and chromatin appearance), cytoplasm quality, and cell layout. Each criterion was scored on a scale from 0 to 100, in increments of 10.
2.7. Cell Classification Using Generated Images
One of the applications of the images generated by the proposed method is their use in classification tasks. This study specifically focused on lung cancer cells, targeting adenocarcinoma and squamous cell carcinoma, which are often challenging to classify in cytology. The objective was to develop a deep learning model capable of distinguishing between these two cell types. The classification model was trained and evaluated using both real images and the generated images.
The classification models employed in this study include state-of-the-art architecture such as VGG-16 [
21], InceptionV3 [
22], ResNet50 [
23], DenseNet121 [
24], and Vision Transformer (Base model with 16 × 16 patches) [
25]. Each of these models was pre-trained on the ImageNet database. We replaced the original fully connected layer, which was designed for classification into 1000 categories, with a multi-layer perceptron consisting of a hidden layer with 1024 neurons and an output layer with two neurons. The entire network was then finetuned using the prepared image dataset. To assess the impact of using generated images in the training process, the data provision methods for the classification models were categorized into four types:
Training with Real Images Only: The model was trained solely using real images.
Training with Generated Images Only: The model was trained using only the generated images.
Training with Mixed Images: The model was trained using a mixture of both real and generated images.
2-STEP Training: The model was first trained solely on generated images and then underwent additional training using real images.
The fourth approach, 2-STEP Training, builds on our previous research [
5], which showed promising results. This method is particularly advantageous when generated images and real images are not completely equivalent. By initially training the model on generated images to establish a coarse classification, we can then finetune it using real images for final adjustments. This two-step training process enhances the model’s overall performance and improves classification accuracy.
For the training dataset, we utilized 151 images of adenocarcinoma and 221 images of squamous cell carcinoma. For evaluation, we employed 41 images of adenocarcinoma and 59 images of squamous cell carcinoma. Using the corresponding findings for a total of 371 training images, we generated 10 images per finding with SDv2, resulting in a total of 3710 generated images, which were used for training.
For the training parameters, we set the learning rate to 1 × 10−6 and employed the SGD optimization algorithm; the input image was normalized so that pixel values took values between 0 and 1. We randomly selected 10% of the training data for validation, enabling early stopping based on the validation loss, with a maximum of 200 epochs for training. To account for variability during training, we performed training and evaluation five times, calculating the mean and standard deviation of the ROC curve’s AUC, sensitivity, and specificity, as well as balanced accuracy and F1-score. The code for training and evaluation was implemented using TensorFlow 2, with processing conducted on a PC equipped with an NVIDIA RTX 6000 Ada GPU.
3. Results
First, generated images obtained by SDv1 and SDv2 are shown in
Figure 4. The outputs include cell images of adenocarcinoma and squamous cell carcinoma, with the CFG varied from 5 to 60 using the same findings. As a preliminary study, cytopathologists and cytotechnologists evaluated five different images by varying the CFG to determine the most preferred CFG value for each model. The median preferred CFG value was 10 for SDv1 and 20 for SDv2, based on their feedback. Consequently, we set the CFG values to 10 and 20 for SDv1 and SDv2, respectively, for use in our evaluations.
Next, using the 3710 generated images, we calculated the histogram similarity of hue, saturation, and value, as well as the FID and KID, with the results shown in
Figure 5 and
Figure 6, respectively. The CFG also varied when calculating these evaluation metrics.
For the visual evaluation, we present a box plot (
Figure 7) displaying the scores assigned by four evaluators regarding the authenticity of the cell nucleus, cytoplasm, and cell arrangement. Additionally, the results of Welch’s t-test, which assesses the significant differences among the real images, SDv1, and SDv2 for each tester, are summarized in
Table 1. Furthermore,
Table 2 illustrates the outcomes of applying the images generated by the proposed method to the classification of cell images and evaluates the classification performance.
4. Discussion
In this study, we proposed a method to generate cell images from cytological findings using text-to-image technology, validating its effectiveness through both quantitative and visual evaluations. The images generated, as shown in
Figure 4, confirmed that faithful representations of cells were produced based on the input findings. Notably, the SDv1 model generated artificially enhanced images with more pronounced outlines and contrasts as the CFG increased. In contrast, SDv2 exhibited minimal changes in image quality across varying CFG levels. Visual assessments indicated that SDv1 produced the most realistic images at a CFG of 10, while SDv2 was deemed most realistic at a CFG of 20. Under these conditions, we evaluated three main components of the cell images: the nucleus, cytoplasm, and cell arrangement. The real images received the highest scores, followed by those from SDv2 and then SDv1. Interestingly, in several evaluation metrics performed by cytopathologists, no significant differences were noted between real and SDv2 images, suggesting that SDv2 achieved a high level of image quality. Conversely, results from cytotechnologists revealed significant differences between the real and generated images, regardless of whether they were produced by SDv1 or SDv2. This discrepancy may arise from the cytotechnologists’ extensive experience in screening a large number of cells, enabling them to distinguish more effectively between real and generated images. Furthermore, the cytoplasmic representation differed notably between the two models: SDv2, trained on a larger dataset, achieved a more natural cytoplasmic depiction with less emphasis on artificial outlines. This is believed to contribute to SDv2’s relatively high scores for cytoplasm. In contrast, for features such as the nucleus and cell arrangement, SDv1 and SDv2 exhibited comparable performance, as these components rely more on fundamental structural characteristics, which both models could represent adequately.
In terms of quantitative evaluation, SDv2 demonstrated superior characteristics compared to SDv1, as indicated by image quality metrics such as FID and KID, which are commonly used for assessing generative AI performance. Notably, the higher-quality SDv2 achieved optimal results for both metrics at a CFG of 20, which aligned with the visually assessed optimal conditions. Conversely, SDv1 exhibited a significant discrepancy between quantitative and visual evaluations. This inconsistency may stem from considerable differences in edges and color tones between the real images and those generated by SDv1, leading to an inaccurate overall similarity calculation. When examining basic image quality characteristics, such as hue, saturation, and brightness, SDv1 showed minimal changes with varying CFG levels. In contrast, for SDv2, it was observed that as CFG increased, the consistency of hue decreased while the consistency of brightness improved. This suggests that although CFG is a parameter controlling the fidelity of the generated images to the provided text, increasing the CFG may shift the model’s focus toward aligning the brightness distribution with that of the real images rather than maintaining consistency in hue.
The results in
Table 2 demonstrate that our proposed methods, “2-STEP training” and “Mixed”, consistently achieved higher AUC and F1 scores compared to traditional approaches such as “Real only” and “Generated only”. This improvement is particularly notable in models such as VGG16 and Inception V3, where “2-STEP training” yielded the highest performance metrics, indicating the effectiveness of our approach in leveraging both real and synthetic data. Specifically, “2-STEP training” allowed models to benefit from the unique features present in generated data while fine-tuning with real data for enhanced accuracy and robustness. Furthermore, the stability of our proposed methods is evidenced by the variability across repeated evaluations, which remained under 3% for most metrics, underscoring the reliability of these methods.
Among several classification models, the vision transformer showed the best classification performance. Vision transformer is an image classification model that divides images into small patches and allows the transformer to analyze them, including their individual relationships. A cytological image contains many cells. We believe that a high classification accuracy rate was achieved by accurately identifying the relationship between these cells and the cells to be focused on. Furthermore, in many CNN models, high accuracy was achieved through two-step training, whereas the vision transformer achieved the best accuracy by mixing real and generated images. It is said that vision transformers can perform more advanced classification than CNNs due to the attention mechanism. We think that the vision transformer achieves high performance in a single training session, even when images with slightly different characteristics are mixed together.
In our previous study, the correct classification rate for lung cancer histology was 71% [
3]; although this rate is not comparable under the same conditions. Since the analysis results show that the rate of correct classification by cytopathologists is comparable to that in the previous study, we can conclude that the rate of correct classification in this study is considerably high. Since the effect of the generated images was also obtained, there is a high possibility that this method will contribute to improving the accuracy of cytological diagnosis. In clinical settings, distinguishing between adenocarcinoma and squamous cell carcinoma can often be challenging, with some cases requiring tissue or immunohistochemical samples that take time to prepare for a definitive diagnosis. If our proposed method can accurately classify these types, it could expedite diagnoses, enabling earlier treatment initiation. Additionally, due to a shortage of cytotechnologists in many hospitals, integrating this method as one part of the double-checking process could enhance diagnostic quality and streamline the workflow, potentially improving the efficiency and reliability of cytological assessments.
From the above experimental results, it was demonstrated that various realistic cytological images can be generated from image findings, and although there are differences in image quality between generated and real images, innovative usage can achieve better classification performance.
The method also has multiple potential applications. First, the generated images could be used as an educational tool in actual medical practice. In cytology education, it is difficult to prepare a large number of images of different cases and cell types. Using this technology, high-quality cell images based on a variety of cases can be easily generated to compensate for the lack of data in education. The generated images can play an important role in the training of cytotechnologists and cytopathologists, improving the accuracy of training in a field where the number of cases is limited.
The results of this study also demonstrate the effectiveness of data augmentation with generated images, which may contribute to solving data shortages, especially in the training of deep learning models. In medical imaging, where data collection is often difficult due to patient privacy and ethical issues, it is expected that generated images can be used to expand the dataset and improve the performance of AI models. In particular, the two-step training method implemented in this study has been shown to improve classification accuracy by performing coarse classification on the generated images, followed by finetuning on the real images. We believe that this method is applicable to other diagnostic domains and image analysis tasks. Furthermore, it also has potential applications beyond clinical diagnosis to educational settings, providing a valuable tool for teaching and developing diagnostic skills through the generation of diverse training datasets.
One limitation of this study is the variability observed between evaluators in subjective assessments. While the primary diagnostic components—such as the nucleus, cytoplasm, and cell arrangement—were evaluated based on previous methods [
26], allowing us to identify general trends, future efforts should focus on reducing observer variability and refining evaluation criteria to achieve more accurate assessments. Additionally, a detailed comparison with traditional generative methods, such as GANs, is necessary. In this study, the pre-trained image generation model was fine-tuned with a limited number of images, whereas GANs typically require training from scratch with a large dataset, making such direct fine-tuning challenging. To accurately compare the performance of the proposed method with that of GANs, a substantial amount of image data will be required, which remains a task for future work. Furthermore, this study focused on adenocarcinoma and squamous cell carcinoma of the lung, but the development of image generation and classification methods for cancers of other organs is also needed. Following these investigations, advancing clinical applications will require the development of software that can be used in clinical and educational settings.