CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation
Abstract
:1. Introduction
- Our work combines CLIP with a few-shot semantic segmentation model based on a prototype structure. This approach addresses the problem of over-fitting when fine-tuning CLIP uses a few support images.
- We propose MSP that involves text samples in model training and introduce image text fusion features in the prototype generation process. Multi-modal support prototypes are better at representing the same semantic information of an image and text compared to single-modal prototype features for representing an object class.
- We propose the AFBM module, which uses the foreground and background information of an image combined with query image features and MSP to generate an adaptive query prototype. Experiments demonstrate the excellent performance of our method on diverse datasets.
2. Related Work
2.1. Semantic Segmentation
2.2. Few-Shot Learning
2.3. Few-Shot Semantic Segmentation
2.4. CLIP in Segmentation
3. Method
3.1. Task Description
3.2. Image-Text Feature Fusion Processing
3.3. Multi-Modal Support Prototype Generator
3.4. Adaptive Foreground Background Matching Module
3.5. Multi-Prototype Matching Loss Function
4. Experiments
4.1. Datasets and Implementation Details
- Datasets. We conduct experiments on two benchmark datasets, namely PASCAL- [62] and COCO- [2], where the PASCAL dataset was used as a benchmark for evaluating the performance of different image segmentation methods. The dataset contains images of 20 different object classes, each labeled at the pixel level, meaning that each pixel is labeled according to the object to which it belongs. We follow previous work and divide the 20 categories in the PASCAL dataset into four folds, each containing five categories. We use three folds for training and one fold for inference, ensuring that the training set and the test set do not intersect in FSS task. To ensure the validity of the experiment, we use fold0 for inference when the remaining three folds are used for training, and use fold1 for inference when other fold is used for training. We repeat these experiments four times and report the performance of each fold separately. The COCO dataset contains over 330,000 images, featuring more than 80 different types of objects commonly found in complex real-world scenes. Compared to PASCAL, the COCO dataset is a significantly more challenging task with much greater category and image scene complexity. In such a challenging task, our method can still achieve good performance. We similarly followed the setup of previous work by dividing the 80 classes in the COCO into four folds and reporting the scores on each fold separately.
- Implementation Details. We used the classical ResNet-50/101 [4] as the backbone network and utilized the pretraining parameters on ImageNet [1]. As CLIP is on the text side, we use VITB-32 as the backbone network. We cropped the original image and ground truth mask to size (473,473). During training, we used stochastic gradient descent with a momentum of 0.9 and an initial learning rate of 0.001 to optimize the model parameters. During training, we use meta-learning to train the model. As described in Section 3.1, our model is trained with 24,000 episodes, each containing one support–query pair. We set one round of training with 1200 episodes, totaling 20 rounds of training per batch of data, with each batch containing four support–query pairs. We randomly selected 1000/4000 support–query pairs for testing, and the ground truth masks of the images were not visible during testing. Consistent with most previous work, we used mean Intersection-over-Union (mIoU) to report the model’s performance on both datasets. The formula for mIoU is shown in Equation (18),
4.2. Comparison with Previous Works
- PASCAL-. To verify the effectiveness of our proposed method, we compared our model with different approaches on the PASCAL and COCO datasets. As shown in Table 1, our model outperforms previous approaches significantly in both the one-shot and five-shot settings. In the one-shot experimental setting, the feature encoder using ResNet-50 exceeds the results of SSP [51] by 2.0% on average across the four folds. This demonstrates the effectiveness of the MSP and AFBM modules. While our current results in the one-shot experimental setting show a 1.1% decrease compared to HSNet, we have observed higher performance in the five-shot setting when compared to HSNet. We contend that this discrepancy arises due to the fact that HSNet utilizes an encoder–decoder architecture, which requires a longer training time compared to our proposed method. As stated in Table 2, the training time for HSNet is reported to be 54 h, whereas our proposed method requires only 5 h of training in the same experimental setting. Under the five-shot experimental setting, using ResNet-50 as the backbone network, we improved the scores of fold1 and fold2 to 73.0% and 75.1%, respectively, which is significantly ahead of previous work. After using the stronger ResNet-101 backbone network, we achieved even higher scores, with a score of 67.8% in fold0 and an average score of 65.9% across all four folds in the one-shot setting. In the five-shot setting, we improved the score of fold0 to 72.8% and the average score across all four folds to 74.5%, which is 4.1% higher than HSNet [55]. The few-shot segmentation model based on the prototype structure uses non-parametric measures, such as similarity functions, to calculate segmentation results, resulting in fast calculation and reasoning times. Although we use VITB-32 as the text feature encoder, this does not significantly increase training and inference times.Table 1. Quantitative comparison results on PASCAL- dataset. The best and second best results are highlighted with bold and underline, respectively.
Method Backbone 1-shot 5-shot fold0 fold1 fold2 fold3 Mean fold0 fold1 fold2 fold3 Mean PANet [8] 44.0 57.5 50.8 44.0 49.1 55.3 67.2 61.3 53.2 59.3 PPNet [9] 48.6 60.6 55.7 46.5 52.8 58.9 68.3 66.8 58.0 63.0 PFENet [53] 61.7 69.5 55.4 56.3 60.8 63.1 70.7 55.8 57.9 61.9 CWT [63] Res-50 56.3 62.0 59.9 47.2 56.4 61.3 68.5 68.5 56.6 63.7 HSNet [55] 64.3 70.7 60.3 60.5 64.0 70.3 73.2 67.4 67.1 69.5 MLC [64] 59.2 71.2 65.6 52.5 62.1 63.5 71.6 71.2 58.1 66.1 SSP [51] 61.4 67.2 65.4 49.7 60.9 68.0 72.0 74.8 60.2 68.8 Ours 63.5 67.8 67.9 52.2 62.9 69.2 73.0 75.1 61.4 69.7 FWB [7] 51.3 64.5 56.7 52.2 56.2 54.8 67.4 62.2 55.3 59.9 PPNet [9] 52.7 62.8 57.4 47.7 55.2 60.3 70.0 69.4 60.7 65.1 PFENet [53] 60.5 69.4 54.4 55.9 60.1 62.8 70.4 54.9 57.6 61.4 CWT [63] Res-101 56.9 65.2 61.2 48.8 58.0 62.6 70.2 68.8 57.2 64.7 HSNet [55] 67.3 72.3 62.0 63.1 66.2 71.8 74.4 67.0 68.3 70.4 MLC [64] 60.8 71.3 61.5 56.9 62.6 65.8 74.9 71.4 63.1 68.8 SSP [51] 63.7 70.1 66.7 55.4 64.0 70.3 76.3 77.8 65.5 72.5 Ours 67.8 71.2 67.7 57.1 65.9 72.8 76.7 81.7 66.7 74.5 - COCO-. This is a very challenging dataset that contains 80 categories and more complex foreground–background relationships, but our proposed method still achieves better results than previous work. As in Table 3, under the one-shot setting with ResNet-50, our model achieved an average score 1.4% higher than SSP [51] and 1.1% higher than MLC [64] across all four folds. In the five-shot setting, we achieved a score of 56.5% in fold0, which is better than most previous approaches. When we used the stronger ResNet-101 backbone network, our model performed even better on complex datasets. In the one-shot setting, our model outperformed SSP [51] by 1.9% on average across all four folds, while in the five-shot setting, we outperformed it by 3.2% on average.
4.3. Efficiency Comparison with Previous Works
4.4. Ablation Studies
- Ablation experiments for different modules. To assess the effectiveness of our methods, we conducted ablation studies on the proposed MSP, AFBM and MML methods. These ablation experiments were conducted using a five-shot setting, and we utilized the ResNet-50 as the backbone network. As shown in Table 4, the proposed MSP improved the model’s performance by 1.6% compared to the baseline. This provides evidence that our proposed multi-modal support prototype effectively improve the model’s predictive capability, and the introduced textual features enhance the support prototype’s ability to recognize a novel class. AFBM further enhanced the model’s performance, improving the average performance by 3% compared to the baseline. We found that combining MSP with AFBM resulted in a significant performance improvement of the model, with the performance increasing to 68.7%, which is 5.6% higher than the baseline. Finally, by incorporating all methods, including the MML loss function, the model’s score increased significantly to 69.7% compared to the baseline of 6.6%. This result demonstrates the effectiveness of our proposed method.
- Ablation experiments for . We used and to generate the predicted mask (as in Equation (9)). The choice of foreground and background thresholds in an image can significantly impact its performance. The threshold size determines which pixels are categorized as foreground or background, which, in turn, affects the accuracy and level of detail in the resulting segmentation. If the threshold is set too high, it will likely result in some foreground pixels being incorrectly assigned to background categories. Conversely, if the threshold is set too low, it will likely result in some background pixels being incorrectly assigned to the foreground category. We conduct ablation experiments for each value of , , and the results of the experiments are shown in Figure 5. Lighter colors represent better results, and darker colors represent worse results. Figure 5 summarizes the prediction scores of the model under different foreground and background thresholds, and the model predicts best when , .
4.5. Visualization Qualitative Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference of the Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; Proceedings—Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Siam, M.; Oreshkin, B.N.; Jagersand, M. Amp: Adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5249–5258. [Google Scholar]
- Liu, L.; Cao, J.; Liu, M.; Guo, Y.; Chen, Q.; Tan, M. Dynamic extension nets for few-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1441–1449. [Google Scholar]
- Nguyen, K.; Todorovic, S. Feature weighting and boosting for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 622–631. [Google Scholar]
- Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
- Liu, Y.; Zhang, X.; Zhang, S.; He, X. Part-aware prototype network for few-shot semantic segmentation. In Proceedings of the 16th European Conference of the Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; Proceedings—Part IX 16. Springer: Cham, Switzerland, 2020; pp. 142–158. [Google Scholar]
- Lin, Z.; Yu, S.; Kuang, Z.; Pathak, D.; Ramanan, D. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19325–19337. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv 2022, arXiv:2208.10442. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. arXiv 2021, arXiv:2110.04544. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Zhang, R.; Fang, R.; Zhang, W.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv 2021, arXiv:2111.03930. [Google Scholar]
- Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven Semantic Segmentation. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
- Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
- Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
- Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19113–19122. [Google Scholar]
- Liu, W.; Zhang, C.; Lin, G.; Liu, F. Crnet: Cross-reference networks for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4165–4173. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Rother, C.; Kolmogorov, V.; Blake, A. “GrabCut” interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG) 2004, 23, 309–314. [Google Scholar] [CrossRef]
- Roerdink, J.B.; Meijster, A. The watershed transform: Definitions, algorithms and parallelization strategies. Fundam. Inform. 2000, 41, 187–228. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference of the Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; Proceedings—Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
- Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A Closer Look at Few-shot Classification. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Gidaris, S.; Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4367–4375. [Google Scholar]
- Dhillon, G.S.; Chaudhari, P.; Ravichandran, A.; Soatto, S. A baseline for few-shot image classification. arXiv 2019, arXiv:1909.02729. [Google Scholar]
- Lake, B.; Lee, C.y.; Glass, J.; Tenenbaum, J. One-shot learning of generative speech concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Quebec City, QC, Canada, 23–26 July 2014; Volume 36. [Google Scholar]
- Hariharan, B.; Girshick, R. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3018–3027. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
- Schwartz, E.; Karlinsky, L.; Shtok, J.; Harary, S.; Marder, M.; Kumar, A.; Feris, R.; Giryes, R.; Bronstein, A. Δ-encoder: An effective sample synthesis method for few-shot object recognition. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Allen, K.; Shelhamer, E.; Shin, H.; Tenenbaum, J. Infinite mixture prototypes for few-shot learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 232–241. [Google Scholar]
- Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; Volume 2. [Google Scholar]
- Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; Luo, J. Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 9–15 June 2019; pp. 7260–7268. [Google Scholar]
- Shaban, A.; Bansal, S.; Liu, Z.; Essa, I.; Boots, B. One-shot learning for semantic segmentation. arXiv 2017, arXiv:1709.03410. [Google Scholar]
- Dong, N.; Xing, E.P. Few-shot semantic segmentation with prototype learning. In Proceedings of the 2018 British Machine Vision Conference (BMVC 2018), Newcastle, UK, 3–6 September 2018; Volume 3. [Google Scholar]
- Zhang, X.; Wei, Y.; Yang, Y.; Huang, T.S. Sg-one: Similarity guidance network for one-shot semantic segmentation. IEEE Trans. Cybern. 2020, 50, 3855–3865. [Google Scholar] [CrossRef] [PubMed]
- Fan, Q.; Pei, W.; Tai, Y.W.; Tang, C.K. Self-support few-shot semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; pp. 701–719. [Google Scholar]
- Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5217–5226. [Google Scholar]
- Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Q.; Liu, B.; Lyu, S.; Chen, H. A self-distillation embedded supervised affinity attention model for few-shot segmentation. IEEE Trans. Cogn. Dev. Syst. 2023. [Google Scholar] [CrossRef]
- Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6941–6952. [Google Scholar]
- Wang, H.; Liu, L.; Zhang, W.; Zhang, J.; Gan, Z.; Wang, Y.; Wang, C.; Wang, H. Iterative Few-shot Semantic Segmentation from Image Label Text. arXiv 2023, arXiv:2303.05646. [Google Scholar]
- Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; pp. 696–712. [Google Scholar]
- Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7076–7086. [Google Scholar]
- Han, M.; Zheng, H.; Wang, C.; Luo, Y.; Hu, H.; Zhang, J.; Wen, Y. PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning. arXiv 2023, arXiv:2308.12757. [Google Scholar]
- Shuai, C.; Fanman, M.; Runtong, Z.; Heqian, Q.; Hongliang, L.; Qingbo, W.; Linfeng, X. Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond. arXiv 2023, arXiv:2308.07539. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Lu, Z.; He, S.; Zhu, X.; Zhang, L.; Song, Y.Z.; Xiang, T. Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8741–8750. [Google Scholar]
- Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. Mining latent classes for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8721–8730. [Google Scholar]
- Wu, Z.; Shi, X.; Lin, G.; Cai, J. Learning meta-class memory for few-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 517–526. [Google Scholar]
- Lang, C.; Cheng, G.; Tu, B.; Han, J. Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8057–8067. [Google Scholar]
- Peng, B.; Tian, Z.; Wu, X.; Wang, C.; Liu, S.; Su, J.; Jia, J. Hierarchical Dense Correlation Distillation for Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23641–23651. [Google Scholar]
- Yang, B.; Liu, C.; Li, B.; Jiao, J.; Ye, Q. Prototype mixture models for few-shot semantic segmentation. In Proceedings of the 16th European Conference of the Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; Proceedings—Part VIII 16. Springer: Cham, Switzerland, 2020; pp. 763–778. [Google Scholar]
- Zhang, B.; Xiao, J.; Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8312–8321. [Google Scholar]
Method | mIoU | Training | Inference |
---|---|---|---|
PFENet [53] | 60.8 | 24 h | 52 ms |
CWT [63] | 56.3 | 10 h | 232 ms |
MMNet [65] | 61.8 | 64 h | 128 ms |
HSNet [55] | 64.0 | 54h | 101 ms |
BAM [66] | 64.6 | 21 h | 50 ms |
HDMNet [67] | 69.4 | 20 h | 56 ms |
Ours | 62.9 | 5 h | 60 ms |
Method | Backbone | 1-shot | 5-shot | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
fold0 | fold1 | fold2 | fold3 | Mean | fold0 | fold1 | fold2 | fold3 | Mean | ||
FWB [7] | 16.9 | 17.9 | 20.9 | 28.8 | 21.1 | 19.1 | 21.4 | 23.9 | 30.0 | 23.6 | |
PANet [8] | 31.5 | 22.6 | 21.5 | 16.2 | 23.0 | 45.9 | 29.2 | 30.6 | 29.6 | 33.8 | |
PPNet [9] | 36.5 | 26.5 | 26.0 | 19.7 | 27.2 | 48.9 | 31.4 | 36.0 | 30.6 | 36.7 | |
CWT [63] | Res-50 | 32.2 | 36.0 | 31.6 | 31.6 | 32.9 | 40.1 | 43.8 | 39.0 | 42.4 | 41.3 |
MLC [64] | 46.8 | 35.3 | 26.2 | 27.1 | 33.9 | 54.1 | 41.2 | 34.1 | 33.1 | 40.6 | |
SSP [51] | 46.4 | 35.2 | 27.3 | 25.4 | 33.6 | 53.8 | 41.5 | 36.0 | 33.7 | 41.3 | |
Ours | 48.3 | 36.5 | 28.9 | 26.5 | 35.0 | 56.5 | 43.9 | 38.0 | 35.6 | 43.5 | |
PFENet [53] | 34.3 | 33.0 | 32.3 | 30.1 | 32.4 | 38.5 | 38.6 | 38.2 | 34.3 | 37.4 | |
PMMs [68] | 29.5 | 36.8 | 28.9 | 27.0 | 30.6 | 33.8 | 42.0 | 33.0 | 33.3 | 35.5 | |
SCL [69] | 36.4 | 38.6 | 37.5 | 35.4 | 37.0 | 38.9 | 40.5 | 41.5 | 38.7 | 39.9 | |
CWT [63] | Res-101 | 30.3 | 36.6 | 30.5 | 32.2 | 32.4 | 38.5 | 46.7 | 39.4 | 43.2 | 42.0 |
MLC [64] | 50.2 | 37.8 | 27.1 | 30.4 | 36.4 | 57.0 | 46.2 | 37.3 | 37.2 | 44.4 | |
SSP [51] | 50.4 | 39.9 | 30.6 | 30.0 | 37.7 | 57.8 | 47.0 | 40.2 | 39.9 | 46.2 | |
Ours | 52.3 | 40.7 | 33.7 | 31.7 | 39.6 | 61.5 | 48.4 | 42.7 | 43.4 | 49.0 |
MSP | AFBM | MML | fold0 | fold1 | fold2 | fold3 | Mean |
---|---|---|---|---|---|---|---|
60.2 | 69.1 | 70.0 | 53.0 | 63.1 | |||
✓ | 62.5 | 70.2 | 71.8 | 54.3 | 64.7 ↑1.6 | ||
✓ | 65.7 | 71.3 | 72.0 | 56.5 | 66.4 ↑3.3 | ||
✓ | ✓ | 68.4 | 72.4 | 73.6 | 60.2 | 68.7 ↑5.6 | |
✓ | ✓ | ✓ | 69.2 | 73.0 | 75.1 | 61.4 | 69.7 ↑6.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, S.-C.; Liu, S.-K.; Wang, J.-Y.; Zheng, W.-M.; Jiang, C.-Y. CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation. Entropy 2023, 25, 1353. https://doi.org/10.3390/e25091353
Guo S-C, Liu S-K, Wang J-Y, Zheng W-M, Jiang C-Y. CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation. Entropy. 2023; 25(9):1353. https://doi.org/10.3390/e25091353
Chicago/Turabian StyleGuo, Shi-Cheng, Shang-Kun Liu, Jing-Yu Wang, Wei-Min Zheng, and Cheng-Yu Jiang. 2023. "CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation" Entropy 25, no. 9: 1353. https://doi.org/10.3390/e25091353
APA StyleGuo, S. -C., Liu, S. -K., Wang, J. -Y., Zheng, W. -M., & Jiang, C. -Y. (2023). CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation. Entropy, 25(9), 1353. https://doi.org/10.3390/e25091353