1. Introduction
With the improvement in medicine, medical image analysis has become an important auxiliary diagnosis and treatment method in medical imaging. The anatomical imaging of organs obtained through medical imaging technologies can objectively reflect tissue structure and pathological change. Medical image segmentation improves efficiency for doctors, from positioning and obtaining evidence to guiding treatment. Deep learning technology has made breakthrough research results in the field of computer vision and image processing. Its powerful feature learning ability has attracted widespread attention. The application of artificial intelligence to medical image processing can not only improve the processing efficiency, but also play an auxiliary role, with doctors subsequently analyzing the condition [
1]. As a key branch in medical imaging research, medical image semantic segmentation tasks have always played a pivotal role in clinical diagnosis [
2]. The mainstream method studies the segmentation of the lesion area in the frontal image. In this paper, we mainly study the segmentation of multi-position medical images.
Artificial intelligence is increasingly popular with the advancement of massive electronic data and improved technology. In the area of combining artificial intelligence and medical treatment, multiple auxiliary diagnosis systems based on convolutional neural networks have become an inevitable trend in developing new medical models [
3], such as the incorporation of positioning, medical image segmentation, and classification. To facilitate accurate and detailed observations of lesions, machine-learning-based automatic recognition and segmentation of medical images can enhance processing efficiency and provide supplementary assistance to doctors to subsequently analyze patients’ conditions [
4]. Due to the COVID-19 pandemic, there has been a massive research focus on lung field segmentation with chest X-ray or computed tomography (CT) images for auxiliary diagnosis. The CT images show clearer edges without overlapping organs compared to chest X-ray images. Chest X-ray images are low contrast and have blurred borders for organ segmentation, but chest X-ray images are in common use in hospitals because of their low price and quick image generation. There is a great need to process large amounts of information, which can be achieved through automatic segmentation. The combination of lung field and heart segmentation improves the efficiency, showing the relative positions of the organs and sizes of the lung fields and heart at the same time. The positions of the lung fields and heart also provides information on the potential risk of the heart and lung fields. On the one hand, lung fields and heart segmentation identifies the relative position and sizes of the lung fields and heart to clarify the illness. On the other hand, automatic segmentation improves the speed of doctors’ ability to deal with diseases. The automatic segmentation of chest X-ray images has improved the image review quality and the speed of disease diagnosis. Our interest is in helping patients to detect potential risk with chest X-ray images and improving the speed of doctors’ diagnoses. For these reasons, this paper focus on lung field and heart segmentation with chest X-ray images.
Medical image segmentation is a critical task in the field of medical imaging. It involves the identification and delineation of specific structures or regions of interest within medical images captured using techniques such as magnetic resonance imaging (MRI), CT, or ultrasound scans. The purpose of segmentation is to extract accurate boundaries or contours of anatomical structures, tumors, lesions, or other areas of clinical significance. Accurate segmentation plays a crucial role in various medical applications, including disease diagnosis, treatment planning, surgical guidance, and monitoring of disease progression. It allows clinicians to analyze and quantify specific regions, measure volumes, track changes over time, and assist in making informed decisions for patient care. However, medical image segmentation is a challenging task due to several factors. Firstly, medical images often exhibit low contrast, noise, and variations in intensity, making it difficult to distinguish between different structures or tissues. Additionally, the shapes and sizes of anatomical structures can vary significantly across patients and even within the same patient, further complicating the segmentation process. To address these challenges, various segmentation techniques have been developed, ranging from traditional methods to more advanced deep learning approaches. Traditional techniques include threshold, region-based methods, active contour models, and graph cuts. These methods rely on image characteristics, statistical information, or prior knowledge to delineate structures.
GANs are capable of generating synthetic data that closely resemble real data, making them useful in various applications such as image synthesis, video generation, and text generation. GANs can learn from unlabeled data, which eliminates the need for manually labeled training examples. This makes GANs flexible and adaptable to a wide range of domains and datasets. In novel data creation, GANs have the potential to generate entirely new and unseen data samples. This can be valuable for creative tasks, generating unique artwork or exploring uncharted areas of data distribution. GANs can be used to augment existing datasets to increase their size and diversity. This helps improve the generalization and performance of machine learning models trained on limited data. GANs can be employed for domain adaptation, where they learn to generate data from a source domain to match the statistics of a target domain. This facilitates transferring knowledge learned from one domain to another.
GAN training can be challenging and unstable at times. It involves a delicate balance between the generator and discriminator networks and finding this equilibrium can be difficult. GANs may suffer from issues like mode collapse, where the generator fails to explore the entire data space, or vanishing gradients. GANs are prone to skipping modes or failing to capture the complete data distribution. This means that certain aspects or modes of the data may not be adequately represented by the generated samples. Assessing the quality of GAN-generated samples objectively is still an open research problem. While subjective evaluation by human observers is often used, developing reliable quantitative metrics for GAN evaluation remains a challenge. Training GANs can be computationally demanding and time consuming, requiring powerful hardware resources like GPUs (graphics processing units) and extensive training iterations to achieve desirable results. The realistic nature of GAN-generated data raises ethical concerns about potential misuse, such as deep-fake technology or generating deceptive content. Ensuring the responsible use of GANs is an ongoing area of concern and research.
In recent years, deep learning approaches, particularly convolutional neural networks (CNNs), have shown remarkable success in medical image segmentation. CNN-based architectures, such as U-Net, SegNet, and DeepLab, have demonstrated high accuracy and efficiency in segmenting diverse medical structures with minimal manual intervention. These models can learn hierarchical features directly from the images and capture complex patterns for accurate segmentation. Supervised classifier learning is also employed. Refs. [
5,
6] introduced the concept of generative adversarial networks, with impressive results on image-generation-quality benchmarks. This approach involves the interplay between a generator and a discriminator, enabling comprehensive training of the discriminator to achieve autonomous segmentation [
7]. Furthermore, deep learning techniques incorporating multi-modal fusion are extensively employed in medical imaging for accurate medical image segmentation [
8]. In the book
Intelligent Data Analysis for Biomedical Application [
9], the authors [
10] utilized machine learning to successfully classify myocardial ischemia using delayed contrast enhancement techniques of magnetic resonance imaging. Additionally, addressing the domain shift problem, a new unsupervised domain adaptive framework called the collaborative image and feature adaptive algorithm (SIFA) [
11] was proposed and found to be effective. In another study, Ref. [
12] directly extracted features from the frequency data of vibration signals and evaluated the performance of feature learning from the original data, and spectrum and time–frequency combined data. This approach successfully applied deep learning in feature extraction for machine-based diagnosis. Furthermore, Ref. [
13] introduced a graph convolution method that utilized multi-resolution pools to capture local and contextual features. Their method enabled the learning of spatially related features in irregular domains within a multi-resolution space. A graph-based convolution method employing product neural networks was proposed for position and direction classification, resulting in improved pose parameter estimation and segmentation performance. Another innovative approach, presented by [
14], is the multi-receiving domain CNN (MRFNet) method. MRFNet employs an encoder–decoder module (EDM) with sub-net, providing optimal receiving fields for each sub-net and generating context information at the functional map level. MRFNet exhibits exceptional performance across all three medical image datasets.
Edge accuracy is often a key aspect and challenge in segmentation [
15]. One proposal suggests utilizing a parameter model with correlation probability density to describe the integration method in the largest posterior form [
16]. In 2019, Hiroki Tsud introduced a method that utilizes generative adversarial networks (GANs) with multiple functions for cell image segmentation [
17]. This method demonstrates improved segmentation accuracy compared to the traditional pix2pix approach [
18]. The field of medical imaging technology generates vast amounts of data. To address this, a heterogeneous framework for multi-core learning based on support vector machines (SVMs) was proposed by [
19]. They investigated the flexibility of this method in comparison to using SVMs and other classifiers to process single features which can enhance the learning ability of the Math Kernel Library (MKL) algorithm. Additionally, Nilanjan Dey proposed several medical applications based on meta-heuristics for segmentation [
20]. CT images are commonly employed not only for analyzing X-ray films but also for examining the internal structure of the heart, necessitating segmentation of the heart region itself [
21,
22]. Due to the complexity and significance of large blood vessels attached around the atria and ventricles, a team led by Lohendran Baskaran devised a method for multi-organ segmentation from coronary computed tomography angiography images [
23].
In contrast, Olivier Ecabert presented a model that encompasses four heart chambers and interconnected large blood vessels [
24]. Avi Ben-cohen introduced a method utilizing a fully convolutional network (FCN) with global context and local plaque level analysis based on super-pixel sparse classification for the detection of liver metastases [
25]. Addressing the complementary nature of salient edge and object information, Ref. [
26] proposed an edge navigation network that employs a three-step approach to simultaneously incorporate and model these two complementary pieces of information within a single network. The experimental results demonstrate improved performance, particularly in scenarios involving rough object boundaries. Additionally, Ref. [
27] proposed a multi-layer densely connected super-resolution 3D network with training guided by generative adversarial networks (GANs). This approach enables fast training and reasoning that outperforms other popular deep learning methods by achieving four times higher image resolution restoration while running six times faster. Furthermore, a novel system has been developed that utilizes CT scans to generate positron emission tomography (PET) virtual images. This system offers the potential for cardiac sarcoidosis evaluation [
28]. Ref. [
29] proposed multiple tasks deep learning model for detection of peripherally inserted central catheter (PICC) which aids in accurate identification and placement of catheters for medical procedures. Zhongrong Wang proposed pixel-wise weighting-based fully convolutional neural networks for left ventricle segmentation in short-axis MRI [
30]. LF-SegNet [
31] is a fully convolutional encoder-decoder network designed for lung fields segmentayion from chest radiographs images which assists in automated analysis and diagnosis of respiratory conditions for enabling more efficient medical interventions.
While significant progress has been made in organ segmentation using deep learning methods, there are still several gaps and challenges that exist. The availability of large-scale annotated datasets for organ segmentation is limited. Creating accurate and comprehensive annotations requires significant time and expertise, leading to a scarcity of labeled data. This constraint hinders the development and evaluation of robust models. The organs can exhibit significant variations in shape, size, and appearance across individuals and even within the same individual due to factors such as pathology or imaging artifacts. Existing algorithms often struggle to handle this variability, leading to sub-optimal segmentation results. Certain organs may have indistinct boundaries or overlap with neighboring structures, making their precise delineation challenging. Algorithms need to effectively handle these ambiguous cases and accurately differentiate organ boundaries from surrounding tissues. Deep learning methods primarily rely on data-driven learning without explicitly incorporating prior anatomical knowledge. Integrating prior knowledge, such as anatomical atlases or spatial constraints, into the segmentation process could improve accuracy and consistency.
The major contributions of this paper are outlined as follows.
A designed U-Net with an SE block called AR-UNet, which takes advantages of the attention mechanism to enhance the segmentation accuracy of the generator so as to improve the performance.
By applying the AR-UNet as the generator of the MWG-UNet structure, the proposed model can both promote accurate lung field segmentation and enhance the stability in model training.
The proposed MWG-UNet is comprehensively evaluated on the JRST and Shenzhen Hospital datasets and achieves the optimal performance for most evaluation metrics except IoU.
The remainder of this paper is organized as follows.
Section 2 introduces the proposed method in detail.
Section 3 presents the experimental results.
Section 4 presents a discussion of our method and others. In
Section 5, conclusions are drawn.
2. Materials and Methods
The methodology includes three parts: the flaws of GANs, Wasserstein GAN, and the improvement of WGAN. The flaws of GANs introduces the shortcomings of GANs from the perspective of mathematical principles. Wasserstein GAN introduces the working principles of WGAN and the improvement compared with GAN. The improvement of WGAN introduces the improvements that have been made based on WGAN.
2.1. GANs
Generative adversarial networks (GANs) are generative models and their training is in the form of a confrontational game. Equation (
1) is the objective loss function of the confrontation.
where
is the true sample distribution, and
is the sample distribution of the generator.
Equation (
2) represents the simplified optimal discriminator. It is derived by setting the derivative of Equation (
1) to 0 and expresses the loss function that captures the relative proportion of the true distribution and the probability of generating the distribution.
One challenge with GAN training is the need to avoid over-training the discriminator. If the discriminator becomes too proficient, it hampers the reduction in the generator’s loss function during the experimental phase. Another issue encountered in GANs is the limited diversity of the generated samples. In scenarios where the true sample distribution and the generator’s sample distribution exist as low-dimensional manifolds within a high-dimensional space, the probability of their overlap being negligible approaches 1. Consequently, regardless of how dissimilar they are, the JS divergence remains constant, resulting in the generator’s gradient approaching (approximately) 0 and effectively disappearing.
In summary, GANs encounter challenges related to the discriminator’s proficiency and the lack of diversity in generated outputs. Mitigating these challenges is crucial to achieving better performance and diversity in GAN-based architecture. The Wasserstein distance is defined as follows:
There are two significant issues with this equivalent minimization objective. Firstly, it aims to minimize the KL divergence between the true distribution and the generator’s distribution, while simultaneously maximizing the JS divergence between them. This approach is flawed, since the Kullback–Leibler (KL) divergence is not a symmetrical measure, leading to intuitive absurdity and numerical gradient instability.
Furthermore, the generator tends to prioritize generating repetitive and safe samples over diverse samples, exacerbating the challenges within GAN frameworks. These fundamental problems can be attributed to the unreasonable nature of the equivalent optimization distance metric and the generator’s loss function, which result in unstable gradients and imbalanced penalties between diversity and accuracy.
To address the latter concern, a proposed solution involves introducing noise to both the generated and real samples. Intuitively, this noise “diffuses” the original low-dimensional manifolds into the entire high-dimensional space, compelling them to have a noticeable overlap. Once an overlap exists, enabling the presence of a gradient, the JS divergence can effectively operate. Notably, as the two distributions become closer, the diffusion of overlap reduces the JS divergence. However, despite these improvements the quest for a quantitative indicator to measure training progress remains unresolved.
2.2. Wasserstein GAN
The Wasserstein distance, introduced in WGAN, addresses the problem of gradient disappearance in theory. Unlike KL divergence and JS divergence, which exhibit abrupt changes and can only be the largest or smallest values, the Wasserstein distance maintains smoothness. When researchers aim to optimize parameters using the gradient descent method, KL and JS divergences fail to provide gradients altogether, while the Wasserstein distance remains capable of providing a gradient.
Similarly, in a high-dimensional space, if two distributions lack overlap or have a negligible overlap, KL and JS divergences cannot accurately represent the distance or offer gradients. In contrast, the Wasserstein distance can provide meaningful gradients in such scenarios. The key advantage of the Wasserstein distance over KL divergence and JS divergence is its ability to reflect the distance between two distributions even when there is no overlap.
By utilizing the Wasserstein distance instead of JS divergence, stable training and progress indicators can be simultaneously achieved. The Wasserstein distance is defined as follows:
Formula (4) represents the marginal distribution of each distribution for the true sample distribution and the sample distribution of the generator. For each possible joint distribution
, the expected value of the distance of the sample under the joint distribution
can be calculated. The Wasserstein distance cannot be drawn directly and the loss function becomes the following formula:
When the Lipschitz continuity condition limits the maximum local variation in a continuous function, a parameter can be constructed. When the last layer of the discriminator network is not a non-linear activation layer, the loss function will approximate the distance between the true distribution and the generated distribution. Due to the excellent nature of the Wasserstein distance, there will be no problem with the disappearance of the generator gradient. The loss functions of the generator and discriminator are as follows:
According to Equation (
6), the smaller the value the better the training. After the improvement of WGAN, there are several improvements compared to the original GAN algorithm:
The sigmoid function is not applicable in the discriminator. GAN’s discriminator performs a two-classification task of true and false. The discriminator in WGAN approximates the Wasserstein distance. So there is no need for the sigmoid function.
The loss of the generator and discriminator does not take the log function.
It limits all parameters of the neural network to no more than a certain range.
It completely solves the problem of GAN training instability; there is no longer a need to carefully balance the training level of the generator and the discriminator.
During training there is a value like cross-entropy to indicate the progress of the training.
2.3. The Overall Framework of MWG-UNet
The basic framework of our proposed method, called MWG-UNet, is shown in
Figure 1. The architecture of MWG-UNet contains a discriminator for distinguishing real and fake data and a generator for making the distribution of generated samples close to the real data. In the generator of MWG-UNet, we use an improved U-Net called ARU-Net, with a squeeze and excitation (SE) block. The generator is responsible for creating synthetic data samples and attempting to generate realistic outputs that can deceive the discriminator. The discriminator is tasked with distinguishing between the real and fake data samples produced by the generator. The results of the generator and the real data are input into the discriminator to distinguish the real data from the fake data to finally realize accurate segmentation. The parameter from the discriminator updates the generator to improve the performance of the generator.
Figure 2 shows the details of the designed AR-UNet as the generator of MWG-UNet. AR-UNet is U-Net transformed with an SE block and residual connection. The structure of AR-UNet comprises an encoder and decoder. The encoder captures the high-level features of an input image and reduces its spatial dimensions, while the decoder recovers the spatial information and generates a segmentation mask by upsampling the encoded features to match the original input resolution. For every step of downsampling, we take 3 × 3 convolution layers to extract the features and the SE block for boosting the discriminated power of each channel, improving both the accuracy and efficiency in various computer vision tasks. After that, we use a 2 × 2 max pooling operation to extract high-dimensional features. On the contrary, for upsampling we use a 2 × 2 upsampling operation to enlarge the feature image. Then, we use the same method as every step of downsampling before max pooling for image processing. There are four steps for the upsampling operation and downsampling operation. In order to map the feature, we use a 1 × 1 convolution layer at the last layers. The skip connection enables the direct flow of low-level spatial information from the encoder to the decoder for facilitating precise localization and fine-grained details in the output and helps mitigate the problem of information loss during max pooling, improving the overall segmentation accuracy of U-Net. For the sake of improving the efficacy in medical image segmentation to obtain spatial features, we take non-linear addition at the final step with a 1 × 1 convolution layer for the output.
The discriminator of MWG-UNet is shown in
Figure 3. In every step of the discriminator, after every two convolutions, the downsampling operator is implemented. After four steps of the max pooling operation, the final output is given by the fully connected layer. Through a series of convolutional or fully connected layers, the discriminator learns to discern subtle patterns and features. The discriminator guides the generator’s learning process by providing feedback on how well it can deceive the discriminator. In this study, a simple discriminator is used for accurate segmentation and saving unnecessary computation.
Unstable training is a common issue with GANs. Although WGAN has made significant strides in achieving stability, there are instances where it generates poor samples or struggles to converge. The introduction of Wasserstein GAN shifts the measurement of probability distributions in GANs from f-divergence to Wasserstein distance, resulting in improved stability during training and generally higher-quality generated outputs. However, WGAN relies on a weight clipping strategy to enforce the Lipschitz constraint on the critic, which can lead to undesired behavior during the training process. To address this limitation, this paper proposes a different truncation pruning strategy known as gradient penalty. This strategy penalizes the gradient’s norm with respect to the critic’s input. By incorporating gradient penalty, the training of WGAN becomes more stable, and the quality of the the generated images improves.
In the realm of medical image segmentation, U-Net and transformed U-Net are commonly used methods. Our approach aims to enhance the performance and robustness by combining the concepts of U-Net and adversarial networks. The paper introduces the addition of Gaussian noise to the generated images and utilizes batch normalization in the discriminator to achieve higher-quality generated results with improved stability. We use the structure of Wasserstein GAN for the adversarial network. In the generator, we use the UNet structure for feature extraction. UNet, as the generator, has an encoding path for context and extracting features and a decoding path for upsampling to recover the spatial information lost during the encoding phase. The combination of WGAN and UNet improves the performance on lung segmentation with regards to the Jaccard similarity and Dice similarity metrics. Details on the value of the combination are shown in
Section 4.
This paper proposes a new architecture called the multi-tasking Wasserstein generative adversarial network U-shape network (MWG-UNet). Multi-tasking refers to the ability of a system or an individual to perform multiple tasks simultaneously or in rapid succession. The goal of multi-task learning is to improve the generalization and performance on each individual task by jointly optimizing the model’s parameters across multiple tasks. In the context of computer systems and artificial intelligence, multi-tasking refers to the capability of a program or a model to handle and execute multiple tasks concurrently. For chest X-ray medical images, multi-tasking improves the efficiency of diagnosis with the relative positions and the shapes of the heart and lung fields. The combination of WGAN and UNet improves the performance of lung segmentation with regards to Jaccard similarity and Dice similarity.
4. Discussion
In the era of big data, artificial intelligence is touching all fields, including medical image processing. Deep learning methods for automatic organ segmentation improve the efficiency of diagnosis. However, organ segmentation presents challenges due to irregular shapes, occlusions, image artifacts, and the limited availability of annotated data. Developing robust and efficient algorithms capable of handling these challenges is crucial for enhancing clinical decision making and improving patient care outcomes. Several deep learning methods have been developed for organ segmentation in medical imaging. U-Net is a popular architecture known for its encoder–decoder structure and skip connections, enabling precise localization. Mask R-CNN combines object detection with instance segmentation, providing accurate organ delineation. FCN utilizes fully convolutional layers to generate pixel-wise predictions. V-Net extends U-Net with 3D convolutions, suitable for volumetric data segmentation. DenseNet employs dense connectivity patterns to enhance feature reuse and gradient flow. Attention-based models like DeepLab utilize spatial attention mechanisms to focus on relevant regions. These methods leverage the power of deep neural networks in capturing intricate patterns and have shown promising results in organ segmentation tasks.
In this paper, we proposed a new method, called MWG-UNet, for medical image segmentation. The proposed method uses the designed AR-UNet as the generator and several convolutional and fully connected layers as the discriminator. AR-UNet combines the U-Net architecture for semantic segmentation with an SE block that captures channel-wise dependencies. The SE block re-calibrates feature maps, enhancing the discriminative power. This integration improves both localization accuracy and feature representation in U-Net, making it more effective for organ segmentation tasks. MWG-UNet with U-Net combines the WGAN framework for stable training of GANs with the AR-UNet architecture for semantic segmentation. This integration enables the generation of realistic and high-quality segmentation masks by leveraging the benefits of both an improved loss function and the ability to capture spatial information and preserve fine details.
The training data used were from JSRT and Shenzhen Hospital, with 862 images. The medical images from Shenzhen No.3 Hospital in Shenzhen provide 340 normal X-ray images and 275 abnormal X-ray images. JSRT encompasses 154 nodule and 93 non-nodule 12-bit gray-scale images with high resolution 2048 × 2048. To handle the limited dataset, we use geometric transformations and rotation as data augmentation to increase the data to avoid gradient explosion. The evaluation metrics are Dice similarity, IoU, recall, precision, and F1 score. As mentioned above, our results for the Dice similarity, precision, and F1 score of the proposed method outperform other models, reaching 95.28%, 96.41%, and 95.90% and the specificity surpasses the sub-optimal models by 0.28%, 0.90%, 0.24%, and 0.90%. The result of the IoU is slightly lower than the optimal method. The IoU demonstrates a worse performance in training. In further research, improving the ability of generalization is an important step to upgrade our algorithm. Our lung field and heart segmentation results have Dice similarity and IoU values of 71.16% and 74.56% for the heart segmentation. The segmentation results of the lung fields have Dice similarity and IoU values of 85.18% and 81.36%.
Based on the above discussion, it is shown that the proposed MWG-UNet is a lung field and heart segmentation model which takes advantage of the attention mechanism to enhance the segmentation accuracy of the generator so as to improve the performance. Although the proposed method presents sub-optimal results on the IoU value, there is still space for further improvement. Firstly, our novel method uses WGAN and AR-UNet for lung segmentation in chest X-ray images and outperforms other methods. Secondly, we proposed an improved U-Net with an SE block which is responsible for creating synthetic data samples and attempting to generate realistic outputs. Lastly, we use multiple task learning for multiple-organ segmentation to improve the efficiency of diagnosis with the relative positions and sizes of the organs.