1. Introduction
Image datasets with high-quality segmentation annotations play a crucial role in computer vision applications such as autopilot [
1], remote sensing [
2], etc. These segmentation annotations also have the potential to be utilized by fundamental downstream tasks, including object detection [
3] and object tracking [
4]. To quickly obtain high-quality annotations, interactive image segmentation is designed to extract regions of interest (ROI) from an image using user annotation information as prior knowledge, reducing the expensive cost of manual labor. Specifically, the popular paradigm of this technique simulates the action of human interactive (e.g., click, stroke, and bounding box) as the response maps and trains the networks with response maps, image data, and their mask labels by supervised learning. For new images, we can simply click on ROI, and the pixel-level annotations could be generated by an interactive segmentation network iteratively.
However, existing methods suffer from insufficient utilization of interaction information and excessively high computational burden in optimization. The inadequate utilization of interactive information stems from insufficient information mining and information decay, leading to issues of over-segmentation and boundary blur, respectively. The high computational burden in optimization is attributed to the fact that existing optimization procedures are nearly equivalent to retraining, resulting in an exorbitant optimization cost. The motivation behind this work is to address these limitations by proposing a method that improves the utilization of interactive information and reduces the computational burden of optimization, making interactive image segmentation more efficient and effective for real-world applications. The following is a detailed analysis from three aspects:
Over-segmentation: Over-segmentation occurs when a large part of the background region is mistakenly segmented as the foreground region in
Figure 1a. This issue arises from the similar features extracted by the learning algorithm from these two regions regardless of the actual circumstances. Specifically, homogeneous regions [
5] and insufficient feature interpretation [
6] are the two primary sources of similar features. Homogeneous regions mean that the background resembles the texture of the target region, while insufficient features mean that the segmentation algorithms are unable to establish the distinct features of background and foreground regions. Fortunately, interactive information in interactive segmentation tasks clearly indicates with response maps whether the pixels of small click regions are the background or foreground. The exploration of semantic information will be conducted through the utilization of response maps and pixel values for the reduction of over-segmentation. If this exploration does not go well, the segmentation result may not align with human expectations. According to Lin et al. [
7], existing methods treat all user interaction information equally, disregarding the distinction between initial interaction information and subsequent interactions. This oversight leads to an increased number of necessary interaction cycles. To address this, they propose FCA-Net, which aims to minimize the need for subsequent clicks. However, the fundamental reason for this issue lies in the fact that existing interactive image segmentation methods mostly rely on fully convolutional neural networks (FCNs) for feature extraction. The local nature of convolutional operations limits the network’s receptive field [
8,
9], resulting in a lack of global information in extracted features. Specifically in interactive image segmentation, this results in less representative features for each interactive region, leading to large-scale segmentation errors.
Boundary blur: Boundary blur denotes that the segmented boundary between foreground and background is jagged, as shown in
Figure 1b, not following the smooth style. This problem stems from the loss of the interaction information increasing with the network hierarchy structure. Although the relationship of interactive and pixel value as the output feature map has been established in each convolution layer, the process of down-sampling in each layer of the network diminishes the interactive information to a certain degree. The severity of this issue worsens with each subsequent layer. Several segmentation techniques attempt to resolve the problem of blurry segmentation boundaries. Cheng et al. [
10] proposed BMask R-CNN, employing the creation of multiple branches to counteract information loss on the main branch. Marin et al. [
11] introduced a new content-adaptive downsampling technique to reduce the loss of fine-grained details during downsampling. In interactive image segmentation, the weakening of user interaction information and the image along the network hierarchy can result in the inability to accurately delineate the foreground and background boundaries in the segmentation region, even with a substantial amount of added interaction information.
Costly optimization: Existing methods [
12,
13] involve fine-tuning a subset of parameters in the model after completing the training process to enhance the segmentation efficiency of challenging target regions. However, this optimization approach incurs a cost comparable to retraining the model from scratch. Although the trained network can provide accurate segmentation results with just a few clicks for images containing known objects (such as humans or motorcycles) present in the training set, it fails to deliver satisfactory results for unknown objects(such as humans and motorcycles), even with a large amount of user input, as shown in
Figure 1c. For such challenging objects that are difficult to segment, this paper adopts a semi-supervised optimization method similar to f-BRS [
13] to address this issue. While f-BRS offers improvements in terms of speed compared with BRS [
12], it still suffers from high time costs. To mitigate this, the paper introduces zero-order optimization based on the foundation of semi-supervised optimization methods to reduce optimization costs while maintaining segmentation accuracy.
For the above three issues, the main contributions of this paper are as follows:
A feature augmentation module based on a self-attention mechanism is introduced to enhance the guiding role of interactive information across the entire image.
Combining the boundary optimization module with PointRend enables the network to fully compensate for lost information and extract detailed local information from the input. This addresses the issue of information degradation caused by the network’s hierarchical structure.
A semi-supervised prediction optimization method has been proposed to reduce time costs by leveraging zero-order optimization. This method utilizes interactive clicks as supervisory information to adjust the output values in the model, thereby enhancing the guiding role of interactive information in segmentation and improving the accuracy and efficiency of segmenting challenging target regions.
Due to the high cost of collecting interactive information, the interaction data used during model training is simulated by a computer. However, in real-world applications, user interaction information needs to be obtained through sensors such as mice, touchscreens, and eye trackers. The user interaction data collected through these sensors enriches the model’s understanding of user intent and enhances the results of image processing. Additionally, this sensor-acquired interaction data can be used for subsequent model fine-tuning, leading to improved segmentation performance.
The rest of the article is organized as follows: In
Section 2, we will review related work and introduce some interactive image segmentation methods from the past and present. In
Section 3, we will present the proposed method. In
Section 4, we will show a series of experimental results. In
Section 5, we will conclude the paper.
3. Methodolgy
In this paper, our work contains network architecture and optimization procedures. Firstly, we propose a multi-scale feature-based interactive image segmentation network named EnNet, which adopts a coarse-to-fine network structure, as shown in
Figure 2. The EnNet consists of CoarseNet and FineNet. The input image and interactive clicks are fed into CoarseNet to generate coarse logits, and then FineNet is used to refine the mask. Secondly, in the inference stage, a semi-supervised prediction optimization algorithm is applied to modify the intermediate output values of EnNet, correcting the misclassified pixels in the region near the interactive points.
3.1. Input
Interactive image segmentation, unlike other segmentation tasks, involves user guidance. Therefore, it requires encoding user annotation information and integrating it with the image before inputting it into the network. The following are the click encoding and fusion modules:
Clicks encoding: The interaction information includes positive and negative clicks, where positive clicks indicate the target region and negative clicks indicate the background. Sofiiuk et al. [
37] conducted an ablation study to compare distance transform encoding with disk encoding and found that disk encoding was superior. Therefore, the clicks are encoded as an interaction click map, with a circle point of radius five pixels used to encode positive and negative clicks, generating a two-channel click map with the same size as the image.
Fusion module: The input fusion module fuses the image and interaction information to generate the input information for the network. The segmentation result generated from the previous interaction is used as additional interaction information and is concatenated along the channel dimension with the click map to generate a three-channel interaction map. Finally, the image and interaction map are fused by increasing their dimensions with
and
convolutions and added pointwise. The process of input fusion can be seen in
Figure 3.
3.2. CoarseNet
CoarseNet consists of a basic segmentation network and a global feature enhancement module. We built a basic segmentation network based on DeepLabV3+ [
38], which consists of an input fusion module, a backbone network ResNet101 [
39], an ASPP module, and a decoder.
Fully convolutional neural networks have fixed receptive fields and translation invariance, which may make it challenging for them to capture globally effective contextual information and exhibit poor sensitivity to specific locations when compared with attention mechanisms. Therefore, in interactive image segmentation using fully convolutional neural networks, it may be challenging to achieve accurate understanding of image semantics and user interaction information, leading to significant errors in segmenting target regions in practical applications. In this paper, we introduce a Global Feature Enhancement Module (GFEM) based on self-attention mechanism to enhance the global contextual information aggregation capability of the model, enrich the representative features of interactive regions in each layer, and make the model more accurate in selecting target regions. The GFEM module consists of an embedding layer and three layers of Transformer encoders, as shown in
Figure 4.
The working process of the GFEM module is as follows. The feature map
with a size of
from the backbone network is fed into the embedding layer, followed by a
convolutional operation to obtain a feature map
with a size of
. Since the encoder requires sequential data as input, the spatial dimensions of
are folded into one dimension to obtain a feature sequence
with a dimension of
. Positional encoding is added to
to preserve spatial position information, as shown in Equation (
1).
Then,
is fed into the encoder to calculate multi-head self-attention and obtain
, as shown in Equation (
2).
After layer normalization,
is obtained through
, as shown in Equation (
3).
Repeat this step
L times to obtain the output of the last layer
of the encoder. Finally, after layer normalization,
is obtained and reshaped into the feature map
with a size of
, as shown in Equation (
4).
In the above equation, represents the convolution, represents the position encoding, represents the multi-head self-attention, represents the layer, represents the layer normalization, and represents the reshaping operation.
3.3. FineNet
The multiple downsampling operations in CoarseNet lead to a weakening of the guidance of interactive information and the loss of image information. Consequently, the segmentation results produced by CoarseNet exhibit inaccuracy at the boundaries of the target region. In this paper, we propose FineNet to solve this problem. FineNet consists of an edge refinement module (ERM) based on fine-grained features. ERM further refines the boundaries of the coarse segmentation result to obtain a more accurate fine mask. As shown in
Figure 5, ERM consists of a convolution-based fine-grained feature extraction module and PointRend [
40]. This module concatenates the image, click map, and coarse logits along the channel dimension as input, first uses a shallow convolutional network to extract fine-grained features, and then uses PointRend to optimize the boundaries of the coarse logits using rich detail information in the fine-grained features, thereby obtaining a more accurate boundary segmentation result.
The detailed processing steps of ERM are as follows. The input of ERM includes the image, positive and negative interactive clicks, and coarse logits. This helps to restore the lost information that occurs during extensive downsampling. As larger-sized disks are not suitable for indicating the boundaries of the target region, disks with a radius of two pixels are used to encode positive and negative clicks, generating a two-channel click map with the same size as the image. The original image, click map, and coarse logits are concatenated along the channel dimension to generate a six-channel input information map. The process of input information processing is shown in Equation (
5), where
I represents the original image,
represents the click map, and
represents the coarse segmentation result.
As shallow networks have smaller receptive fields and smaller overlapping regions, they ensure that the network captures more details. ERM module uses a shallow convolutional network to extract fine-grained features, as shown in Equation (
6). The input information
X is extracted by five convolutional blocks to obtain fine-grained features
.
PointRend is an upsampling method for image segmentation, designed to enhance the segmentation performance of object boundaries by leveraging detailed features.
3.4. Loss Function
The EnNet consists of CoarseNet and FineNet. CoarseNet focuses on the overall target region, while FineNet is more focused on the boundaries of the target region. Considering the differences mentioned above, different parts should use different loss.
CoarseNet uses the Normalized Focal Loss (NFL) as shown in Equation (
7).
is an adjustable factor greater than 0, and
represents the predicted confidence of the pixel point
, defined as shown in Equation (
8).
is the predicted value of pixel point
, and
y is the ground truth label of pixel point
, with
indicating the target region and
indicating the background.
reflects the difficulty level of pixel classification, with a higher
indicating a higher confidence and easier classification, while a lower
indicating a lower confidence and more difficult classification. Therefore, NFL increases the weight of hard-to-segment samples in the loss, which helps to improve the accuracy of hard samples.
FineNet is designed to improve the segmentation performance of the model for the object boundaries. Therefore, the loss function of FineNet needs to focus more on the pixels on the object boundaries. To this end, ERM adopts a normalized focal loss with boundary weighting in the calculation of the loss. The pixels on the object boundaries are assigned higher weights, as shown in Equation (
9). Specifically,
indicates that the pixel
is on the object boundaries, while
indicates that it is another pixel. The weight of the loss caused by the pixels on the object boundaries, denoted by
, is set to 1.5.
The Point Head in the PointRend module of ERM is supervised by binary cross entropy, as shown in Equation (
10), where
represents the set of points selected by PointRend for optimization.
The final loss function is formulated as Equation (
11).
3.5. Semi-Supervised Prediction Optimization Method
Interactive clicks represent the user’s intentions, so the segmentation results near the click locations should match the click information. However, in practical scenarios where the target region in the image is similar to the background, the target region is partially occluded, the texture is complex, or the target region is unknown, the segmentation results near the interactive clicks may not be correct and may not align with the click information. For such challenging segmentation scenarios, after model training is completed, we use user annotation data to fine-tune specific model parameters, thereby enhancing the instructive nature of user interaction information. However, this approach incurs a time cost comparable to training the entire model. Therefore, to reduce the time expense, we employ zero-order optimization.
In this paper, we formulate this problem as an optimization problem and propose a semi-supervised prediction optimization method (POM) that uses user-provided interactive clicks as supervised information to adjust the output of the GFEM module in the EnNet network model during the inference stage to ensure that the pixels in the local area near the interactive click are correctly segmented in the final segmentation result, as shown in
Figure 6. In detail, let
z be the output of the GFEM module,
g be the processing process of the latter part of the network model, and
be the final segmentation result, which can be represented by Equation (
12).
This algorithm takes
z as the optimization object. During the model inference stage, when the predicted result at the interactive point does not match the interactive information, this algorithm adjusts the
z value by fine-tuning to change the output of the network model so that the predicted result at the interactive point matches the interactive information. The final segmentation result is corrected by adjusting the output
z of the GFEM module.The process of correcting the final segmentation result by adjusting the output
z of the FEM module can be formulated as an optimization problem described in Equation (
13) that seeks to find the value of
z that minimizes the loss function
.
The optimization loss function
is used to measure the error between the model’s predicted results
and the true labels
y. In the input fusion module of the EnNet network model, the interaction points are encoded as circular disks with a radius of five pixels centered at the interaction point, and the loss function
only calculates the prediction error at these circular disks to improve the segmentation accuracy in the local area near the interaction points. Let
p be the set of interaction points;
and
are sets of interaction points labeled as foreground and background. For an interaction point
i in
P , the predicted result of the model at point
i is denoted as
; the label values at the foreground interaction point and the background interaction point are 1 and 0. The prediction loss for the foreground interaction points and the background interaction points are shown in Equations (
14) and (
15), respectively.
To prevent large variations of
z during the algorithm execution, a second-order norm is introduced as a constraint term in the loss function, and the weight coefficient
is fixed at 0.001. Therefore, the optimization loss function is shown in Equation (
16), where
represents the initial value of
z.
We minimize the prediction loss function
and obtain the optimal
with the L-BFGS algorithm. In each iteration of the L-BFGS algorithm, the gradient of the prediction loss function
with respect to
z needs to be calculated using the back-propagation algorithm applied to the latter part of the network model. This process requires multiple derivative computations using the chain rule and can be time-consuming. To reduce the time cost, the proposed prediction optimization method can use zero-order gradient estimation [
41] to approximate the gradient of the loss function
with respect to
z without sacrificing too much accuracy. The zero-order gradient estimation method imitates the first-order method and uses finite difference to estimate the gradient using two very close points of the objective function values,
and
, and a small constant
h along the direction vector
v. The gradient of the loss function
at sample point
i can be approximated using zero-order gradient estimation as shown in Equation (
17), where
h is a small constant and
is the standard vector base with only the
component equal to 1.
The prediction optimization method employs L-BFGS for multi-step iterations to obtain the optimal solution. As the iterations progress, the accuracy of the required gradients also increases. Although zero-order gradient estimation methods may introduce some errors in estimating the gradients, they can still be used to approximate the gradients within an acceptable error range, thereby reducing the time cost of the prediction optimization method.
4. Experiments and Results
In this section, the efficacy of GFEM, ERM, and POM has been validated through dissolution experiments, exemplifying the superiority of these methods through comparisons with other approaches. Additionally, the utilized dataset, evaluation metrics, and experimental details are presented.
4.1. Datasets
We trained the proposed model on the COCO + LVIS dataset, which is a combination of COCO [
42] and LVIS [
43]. Specifically, this dataset includes a subset of images and corresponding annotations from the LVIS dataset, supplemented with annotations from the COCO dataset, resulting in a dataset with 104,000 images and 1.6 million instance-level annotations.
We evaluate the proposed method on the GrabCut [
15], Berkeley [
44], DAVIS [
45], and SBD datasets [
46]. The GrabCut dataset contains 50 images and 50 instance-level labels; the Berkeley dataset includes 200 training images and 100 testing images, of which 96 testing images have 100 instance-level labels; the DAVIS dataset, usually used for video object segmentation, includes 50 high-quality annotated videos, from which we extract 10% annotated frames as the testing set for interactive image segmentation, consisting of 345 images and 345 instance-level labels; the SBD dataset consists of 11,355 images, including 8498 training images and 2857 validation images, of which the validation set has 6671 instance-level labels, commonly used as the testing set for interactive image segmentation.
4.2. Evaluation Metrics
Mean intersection over union (mIoU), the standard number of clicks (NoC), and Number of Failure (NoF) are used as evaluation metrics. NoC represents the number of interactive clicks required to achieve an IoU threshold, which is typically set to 85% and 90%, denoted as NoC@85 and NoC@90, respectively. NoF refers to the number of images in the test set that cannot reach the specified IoU threshold even after a certain number of interactive clicks, which is usually set to 90%, denoted as NoF@90.
4.3. Implementation Details
In this paper, we employ the interactive point simulation sampling strategy proposed in RITM to generate interaction points with a maximum of 20 points, including 1–10 positive samples and 0–10 negative samples. The COCO + LVIS dataset is used as the training set, with random scaling in the range of [0.75, 1.4] and random cropping to a size of [320, 480]. The training images are augmented by horizontal flipping, vertical flipping, edge padding, brightness jittering, contrast jittering, and color jittering. The backbone network is initialized with weights pretrained on ImageNet, while the remaining network structures are initialized using the Kaiming initialization scheme. The optimizer uses Adam, with set to 0.9 and set to 0.999. A total of 230 epochs are trained with a batch size of 64, an initial learning rate of , and a learning rate decay rate of 0.1, with learning rate decay applied at the 190th and 210th epochs. All experiments are conducted on a PC with an Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3090 (NVIDIA, Santa Clara, CA, USA) running an Ubuntu 20.04.1LTS environment.
4.4. Ablation Study
In this paper, we conducted an ablation study on the proposed components by using the basic segmentation network (BSM) as the baseline. The effectiveness of each component was evaluated.
GFEM: As shown in
Table 1, adding GFEM to the baseline model improved the results of NoC@85, NoC@90, and NoF@90, indicating that GFEM can enhance the accuracy and convergence of the model. To assess whether the GFEM module can enhance the understanding of user interaction information and image semantics, the paper presents feature heatmaps of the model outputs with and without GFEM, as shown in
Figure 7. It can be observed that the outputs with GFEM more accurately delineate the contours of the target rather than providing a rough estimate. This indicates that the inclusion of GFEM improves the understanding of user interaction information and image semantics, leading to better segmentation results. The predicted masks of baseline + GFEM are shown in
Figure 8. GFEM applies self-attention mechanisms to weight the correlation between different regions, allowing the model to focus more on the information related to the target region, ignore noise information unrelated to the target region, extract more accurate feature representations, and improve the effectiveness of the model.
ERM: As shown in
Table 1, by adding ERM to both the Baseline and Baseline + GFEM models, improved results are achieved in terms of NoC@85, NoC@90, and NoF@90. However, it is worth noting that Baseline + GFEM already achieves a rough segmentation of the target region. The addition of ERM further enhances the results, indicating that ERM effectively utilizes fine-grained features to optimize the coarse segmentation output. This leads to improved accuracy and convergence of the model, reducing the number of required interactions. The predicted masks of baseline + ERM are shown in
Figure 9. The best results were achieved with the combination of baseline + GFEM + ERM, which demonstrates the effectiveness of the EnNet architecture.
POM: The experiment investigated the relationship between the number of times zero-order gradient estimation is used and the segmentation accuracy under the premise of setting the iteration data buffer size
m of the L-BFGS algorithm to 20, the acceptable error
to
, and the maximum number of iterations to 10. As shown in
Figure 10, the experimental results show that using zero-order gradient estimation in the early iterations of the prediction optimization method will slightly decrease the segmentation accuracy, but when the number of uses exceeds a certain threshold, the segmentation accuracy will decrease significantly. Since the prediction optimization method needs to solve gradients in each iteration, the range of the number of times that zero-order gradient estimation is used is from 0 to 10. Based on the results on several test sets, zero-order gradient estimation is used to calculate the gradients in the first four iterations of the prediction optimization method, and back-propagation is used to calculate the gradients in the subsequent iterations.
After determining the number of times to use zero-order gradient estimation in the prediction optimization method, we applied the prediction optimization method on EnNet to validate its effectiveness by comparing its performance before and after applying the prediction optimization method. As shown in
Table 2,
Table 3 and
Table 4, applying POM on EnNet improves the model’s NoC@85, NoC@90, and NoF@90, indicating that POM can improve the model’s accuracy and convergence.The predicted masks of EnNet+POM are shown in
Figure 11.
4.5. Comparison with Other Works
We evaluate the proposed method on several commonly used interactive image segmentation datasets, including GrabCut, Berkeley, DAVIS, and SBD. As shown in
Table 5,
Table 6 and
Table 7, our method exhibits superior performance in terms of accuracy and convergence compared with other methods. Specifically, on DAVIS and SBD, our method achieves the best NoC@85; on Berkeley, DAVIS, and SBD, our method outperforms other interactive image segmentation methods in terms of NoC@90; and on Berkeley and SBD, our method achieves the best NoF@90. On DAVIS, our method’s NoF@90 is second only to RITM.
Figure 12 shows the average intersection over union (IoU) with the number of interactions on GrabCut, Berkeley, DAVIS, and SBD datasets for our method and other methods. Compared with other methods, our method achieves higher accuracy with the same number of interactions, and the IoU curve of our method has smaller fluctuations.