1. Introduction
Breast cancer is one of the leading causes of death in women around the world and diagnosing breast cancer in its early stages will always remain crucial [
1,
2]. Breast ultrasound is widely used in clinical diagnosis for its advantages of safety and low cost. Generally, accurate tumor segmentation is necessary and significant for precise diagnosis using breast ultrasound. However, fully automatic segmentation methods are difficult to obtain accurate results that can meet clinical analysis standards [
3]. This is mainly related to the poor quality of the ultrasound images, but also to the limitations of the segmentation model. Compared to automatic segmentation methods that gives results at once, the advantage of interactive segmentation is that the user provides prior knowledge about the object through interactions to guide the refinement of the segmentation result [
4]. In a real clinical situation, each patient may have multiple ultrasound images, and it is unrealistic to use manual annotation of tumor boundaries for all of them. Therefore, interactive segmentation tools with fast implementation of high accuracy segmentation have a significant meaning for clinical use.
There are three key points of excellent interaction segmentation of medical images: simple type of interactions, efficient interaction information transfer, and the use of prior knowledge. Existing interactive segmentation methods have different types of interactions, which can be divided into providing points [
5,
6,
7,
8,
9], scribbles [
10,
11,
12,
13,
14,
15,
16], a bounding box (BB), or a polygon box (PB) [
11,
17]. Among these ways, providing scribbles, BB and PB require the user to swipe the mouse pointer over the image for a long time, while clicking on points is intuitively the easiest interaction type. Interaction information transfer refers to the way of using user interactions to guide the segmentation. Most of the existing interaction segmentation methods use Gaussian probability maps, distance transforms, etc., to transfer user interaction information to the segmentation mode. However, they cannot utilize both the location information of interaction points and the contextual information of the image. Since human interaction is actually providing prior information to the network, using prior information in the model can reduce the number of user interactions but few ways take advantage of this.
Some conventional interactive segmentation methods are based on graph theory. The graph cuts [
10] method uses the Gaussian mixture model (GMM) as the underpinning model and needs the user’s scribbles for refinement. In this method, a large number of scribbles are needed before getting a satisfactory accuracy. GrabCut [
11] requires the user to provide a bounding box to limit the region of interest (ROI) to take less scribbles, but its performance is poor on medical images as graph cuts [
10] on account of the same GMM model. The Random walker segmentation method [
12] uses random walker as the basic model to attain a refined result. These three methods all require a lot of user interactions due to the poor performance of underpinning model. In 2007, Bai et al. proposed an interactive framework [
13] using geodesic distances to convert user-provided scribbles, so that the target could be automatically segmented. This was the first method to use geodesic distance transform for interactive segmentation, while some subsequent methods [
14,
15,
16] have improved on this. However, all of them only perform well on images with large differences between foreground and background, because the geodesic distance focuses on the gradient information of the original image.
In order to break through the limitations of traditional method, interactive segmentation methods based on CNNs have been proposed. Xu et al. [
5] converted user’s interaction points into Euclidean distance maps based on foreground and background points. The five-channel image (original RGB channels and two distance transform map) was used as the input of a full convolutional network (FCN) to obtain the segmentation result. Euclidean distance transform is concerned with the location information of interaction points and cannot utilize image contexts information. BIFSeg [
6] uses an image segmentation method similar to GrabCut [
11]. The user first draws a boundary box as the input for CNN to obtain an initial result. Then, image-specific fine-tuning conducts CNN to improve segmentation results. DeepIGeoS [
7] firstly proposes using geodesic distance maps as part of the input for CNNs. Geodesic distance maps can reflect the grayscale texture information of the original image by calculating the shortest distance from the full image to a specific point, so that CNNs can identify mis-segmentations of foreground and background from the input data to refine the segmentation result. However, it is sensitive to the contrast and spatial information of the image, and lacks the importance of clearly indicating the location information of the interaction point. For example, in the case of blurred tumor boundary, the geodesic distance near the boundary does not change significantly due to the small image gray gradient change, while the Euclidean distance is only related to the locations of interactions and not influenced by the quality of the original image. This means that the Euclidean distance is more effective than the geodesic distance in pointing out the misalignment area. Therefore, there is an urgent need for a method that combines the advantages of both distance transforms.
In this paper, we proposed a one-stage interactive segmentation framework for breast ultrasound image based on the above three key points. Compared with existing two-stage interactive segmentation networks [
6,
7] to refine the result of automatic segmentation network, our method has several advantages. First, our method can use the same CNN network (I-net) to obtain the automatic segmentation and refined results in turn. We trained I-net on automatic segmentation task to ensure that it could provide an initial segmentation when inputting the only original image. Second, our method has more effective interaction information transfer. We proposed a weighted distance transform combined geodesic distance and Euclidean distance transforms, which means the distance map could reflect both the texture information near the object area and the location information of the interaction points in the whole image. Third, our method can reduce the number of interactions for the use of prior information in the training phase. We referred to the proposed framework as the interactive segmentation using weighted distance transform (WDTISeg).
The main contributions of the proposed method are as follows:
(1) We proposed a one-stage interactive segmentation framework for breast ultrasound image segmentation, which is the first method to use a network to get both automatic and interactive segmentation. The training process was greatly simplified because no additional automatic segmentation network was required to provide the initial results;
(2) We proposed to convert user interactions into maps with weighted distance transform which combines geodesic distance and Euclidean distance transforms. This combination can effectively convey both location information of interaction points and exploit image contexts knowledge;
(3) We proposed a shape-aware compound loss function using the prior knowledge of breast tumors in the training phase to reduce the number of interactions. The compound loss function improved the accuracy of model segmentation while avoiding oscillation and overfitting in the training process.
3. Experiments Results and Discussion
3.1. Setting
A dataset of 2200 breast ultrasound images was acquired in Fudan University Shanghai Cancer Center, Shanghai, China from January 2019 to December 2019. The equipment used to obtain ultrasound images included the Aixplorer ultrasound system (SuperSonic Imagine S.A., Aix-en-Provence, France) at 7–15 MHz and the Resona 5S ultrasound system (Shenzhen Mindray Bio-Medical Electronics Co. Ltd., Shenzhen, China) at 5–14 MHz. All images were stored in DICOM format. Each ultrasound image has a tumor segmentation that has been precisely outlined by an experienced radiologist as the ground truth. The image size range is from 721 × 496 to 931 × 606. All images are resized to 256 × 256 before being fed into the network.
All images are arranged in chronological order of the patients’ diagnosis. We used the first 2000 cases for training and the remaining 200 cases for testing, which ensured the independence of the patients in our training dataset and testing dataset.
For the quantitative evaluation, our work employed the dice value (dice) (%).
where
X and
Y represent the same qualities as in (10).
3.2. Implementation Details This Is Example 1 of an Equation
Adam [
26] with a learning rate at
was used to be the optimizer in the training stage. The batch size was 32 and the ratio of validation was set to 20% (200 cases). The model was trained for 50 epochs and only saved at the best validation loss. We trained and tested our interactive network using an Intel(R) Xeon(R) Gold 6130 CPU at 2.10 GHz and an NVIDIA TESLA V100 (32G).
Our WDTIseg was at low cost during the training and testing phases. In the training phase, WDTIseg was trained with different and loss functions, while the average training time was 624.6 s. The model size was 385 Mb. In the testing phase, the time from the input image put into the network to attain the final segmentation after 8 interactions was recorded, while the average cost was 17.6 s.
3.3. Performance on Automatic Segmentation Task
Our proposed framework WDTISeg could both obtain automatic segmentation and refine results based on interactions. To demonstrate that our method did not require an additional training of an automatic segmentation network to obtain the initial segmentation, we compared the automatic segmentation results of U-net and WDTISeg.
Table 1 shows automatic segmentation results of U-net and WDTISeg. The dice of automatic segmentation results of WDTISeg was 82.86 ± 16.22 (%), better than that of U-net. In the automatic segmentation examples in
Figure 5, it is clear that the results of WDTISeg were similar to U-net, and the segmentation results were even slightly more compact.
These prove that WDTISeg can still have comparable automatic segmentation performance to U-net after interactive segmentation training.
Our study focused on improving automatic segmentation-based refinement, and for the first time, we proposed that the interactive segmentation network can generate the initial segmentation results by itself without the need to train additional automatic segmentation networks.
3.4. Impact of the Factor λ in Weighted Distance Transform
To verify the effectiveness of combining the two distance transforms, we compared the single interaction results when λ took different values. Different values represent the different weights of the two distance transforms. The weighted distance became Euclidean distance completely when λ took 0, and Geodesic distance completely when λ took 1. However, we used the same user interactions during the experiment.
As can be seen from
Table 1, the interactive segmentation method performed much better than the conventional automatic segmentation method U-net, by as much as 10%. Our method with the parameter
achieved a dice score of 94.45 ± 3.26% and it performed better than the other four values of
. By fixing the loss function to be
, we can see that the results when λ was t between 0 and 1 were better than both 0 and 1. This proved that combining the two distance conversions can perform better than using either method alone on an interactive segmentation task.
Figure 6 shows a comparison of the segmentation results of our method with different values of λ by given the same user interaction points. The upper case 1 is a tumor with an obscure border, where the interaction point location information is more important than the texture information. In this case, the performance of Euclidean distance transform should be better than Geodesic transform, which is as the same in
Figure 6. In the lower case 2, the tumor boundary is obvious, but it has a mis-segmentation outside the tumor. This requires both texture information to ensure correct segmentation of the tumor region and interaction point location information to instruct the network to remove mis-segmented regions outside the tumor. Therefore, λ of 0.5 is better than any other value in case 2. This proves that the combination of our two distance transforms is beneficial in dealing with tumors in different cases.
The combination of Euclidean distance transform and geodesic distance transform can both convey the location information of interaction points and make use of image context information. The experimental results demonstrate that this combination improves the stability of the segmentation model to cope with images that are difficult to segment.
3.5. Effect of Proposed Loss Function
We explored the effect of our involved loss function by observing the dice rate on the training and validation datasets, as shown in
Figure 7. On the plot of dice rate on the training dataset,
(Dice loss) achieved the best performance, while
(BCE + Dice + SC loss) came second. The reason why
performed well on the training set is that the network used dice as the evaluation metric, and the network maximizes dice by optimizing the network structure during training. However, the dice rate of
on the validation dataset had a sharp oscillation. This is mainly because
is a region-dependent loss, and if some pixels of a small target are incorrectly predicted, then it will lead to a significant change in the loss value, which will result in a drastic change in the gradient.
Compared with other three loss functions, incorporating prior knowledge achieves optimality on the validation set and does not show more intense oscillations after epochs greater than 30. The BCE loss used for dichotomous classification is insensitive to category imbalance, so it can prevent the oscillation due to to some extent. On the other hand, the loss function based on compact shape constraint utilizes prior information of tumor shape, and thus can improve the accuracy of segmentation. Note that converges to 1 at the minimum when the tumor is circular, so it cannot be used as a segmentation loss function alone.
The purpose of introducing a subjective human into the segmentation process is to use human’s prior knowledge as a supplement to improve the segmentation accuracy. In the interactive segmentation task, human is both the participant in the interactive segmentation process, the prior information provider, and also the evaluator of segmentation results without the ground truth. In the interaction segmentation task, we may learn from the few-shot learning which has been widely used machine learning classification task. Human guides segmentation on a few simple images so that the network can master the segmentation skills, further reducing the training time and human interaction time.
3.6. Quantitative Comparison of Different Methods
We evaluated WDTISeg with graph cuts, random walker, and DeepIGeoS (R-net).
Table 1 presents a quantitative comparison of these methods on the testing data. All results are accepted after 8 interactions for the interactive segmentation method. Compared with the other three methods, the dice of WDTISeg (
) reached 94.45 ± 3.26 (%) after 8 interactions, which fully shows that our method can achieve a high segmentation accuracy with fewer interactions.
Visual comparison results are shown in
Figure 8. All interactive segmentation methods can attain a high accurate segmentation after enough interactions. However, results of graph cuts and random walker showed more rough edges. In contrast, our method was able to obtain a segmentation that fit more closely to the tumor margin. What is more, our WDTISeg only required simple point clicks, while graph cuts and GrabCut require more scribbles or a bounding box. The results of DeepIGeoS are more similar to that of our method, because we also used distance conversion to pass interaction information. However, it can be found that the segmentation result of our method was smoother at the tumor edge, especially at the lower right corner of case 4. This may benefit from the fact that we used a shape constraint loss to impose prior constraints on tumor shape.
4. Conclusions
In this paper, we proposed a one-stage interactive segmentation framework (WDTISeg) for breast ultrasound image segmentation. The ultrasound image was put into the network first to attain an initial segmentation, on which user interaction points were provided to indicate mis-segmentations. Interaction points were converted into distance maps by weighted distance transform to be part of input of the interactive network. The one-stage network of point interaction made the interaction simpler. The loss function designed for the clinical prior knowledge of breast cancer further improved the segmentation accuracy. Comparison with other methods on the test dataset demonstrated the advantages of the proposed method.
However, our method had limitations in combining the two distance transforms. In this paper, in order to verify the usefulness of combining two distance conversion methods, different ratios were tried to conduct experiments, and the experimental results proved in general that combining two methods helps to improve the segmentation accuracy. Considering the differences of different ultrasound images, the most suitable combination ratio should be different for each image. If an optimal ratio value can be obtained adaptively according to the characteristics of the ultrasound image itself, thus attaining the best segmentation result, it will further improve the segmentation accuracy and enhance the segmentation robustness.