Strong and Weak Supervision Combined with CLIP for Water Surface Garbage Detection
Abstract
:1. Introduction
2. Related Work
2.1. Water Surface Garbage Detection
2.2. Cross-Modal Correlation Algorithm
3. Methodology
3.1. Overall Framework
- The 400 million data samples collected by CLIP from the internet provide general knowledge, but they lack specialization for the specific domain of water surface garbage recognition.
- Compared to the original CLIP used for image classification, water surface garbage detection as an object detection task involves more background interference.
- Directly applying image–text contrastive learning by cropping out garbage objects may lead to different instances of the same class appearing in a single batch. Treating these instances as equal negatives in traditional contrastive loss functions is not reasonable.
- Utilizing annotated samples for strong supervision fine-tuning is labor-intensive.
- Data preparation stage: Collect a dataset of water surface garbage images. Perform strong supervision annotation by labeling the pixel-level garbage positions in each image. Perform weak supervision annotation by labeling the presence of garbage at the image level for each image.
- Strong supervision training stage: Use traditional object detection algorithms (such as Faster R-CNN, YOLO, etc.) for strong supervision training. Utilize the data annotated with strong supervision to learn the location information of water surface garbage.
- Weak supervision training stage: Input the images into CLIP’s visual encoder to obtain visual feature representations. Input the image features and image-level textual descriptions into CLIP’s text encoder to obtain textual feature representations for the images. Then, fuse the visual features and textual features to obtain comprehensive feature representations. Combine traditional object detection algorithms for water surface garbage detection and inference.
3.2. Strong and Weak Supervision Combined with CLIP
3.3. Loss Function of Object Detection
3.4. Model Evaluation Indicators
4. Experiments
4.1. Dataset Introduction
4.2. Experiment Setting
4.3. Performance Evaluation
5. Conclusions
- (1)
- All experiments conducted on the two datasets using three different object detection algorithms with eight different backbones indicate that, among the algorithms with the same backbone, Faster R-CNN consistently performs the worst, while Cascade Mask R-CNN consistently performs the best. Among the different backbones used in each algorithm, ViT-B/14 consistently achieves the highest performance, while R-50 consistently achieves the lowest performance, which aligns with the expected pattern.
- (2)
- All experiments conducted on the two datasets using three different object detection algorithms with eight different backbones indicate that the proposed approach of combining strong supervision and weak supervision with CLIP significantly improves the mAP values compared to the baseline object detection models.
- (3)
- Among all the CLIP-based algorithms evaluated on the public dataset, the largest improvement is observed in Faster R-CNN with R-50x4 as the backbone, achieving a 3.25 mAP increase. Among all the CLIP-based algorithms evaluated on the private dataset, the largest improvement is observed in Faster R-CNN with ViT-B/16 as the backbone, achieving a 2.05 mAP increase.
- (4)
- Among all the CLIP-based algorithms evaluated on the two datasets, the Cascade R-CNN model with ViT-B/14 as the backbone consistently achieves the best performance, achieving 57.46 mAP on the public dataset and 79.69 mAP on the private dataset.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chang, H.C.; Hsu, Y.L.; Hung, S.S.; Ou, G.R.; Wu, J.R.; Hsu, C. Autonomous water quality monitoring and water surface cleaning for unmanned surface vehicle. Sensors 2021, 21, 1102. [Google Scholar] [CrossRef] [PubMed]
- Gao, X.; Fu, X. Miniature water surface garbage cleaning robot. In Proceedings of the 2020 International Conference on Computer Engineering and Application (ICCEA), Guangzhou, China, 18–20 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 806–810. [Google Scholar]
- Wang, Y.Q. An analysis of the Viola-Jones face detection algorithm. Image Process. Line 2014, 4, 128–148. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Le, T.N.; Ono, S.; Sugimoto, A.; Kawasaki, H. Attention r-cnn for accident detection. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, Nevada, USA, 19 October–13 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 313–320. [Google Scholar]
- Singh, B.; Najibi, M.; Davis, L.S. Sniper: Efficient multi-scale training. In Proceedings of the Advances in Neural Information Processing Systems 31, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
- Pu, Z.; Geng, X.; Sun, D.; Feng, H.; Chen, J.; Jiang, J. Comparison and Simulation of Deep Learning Detection Algorithms for Floating Objects on the Water Surface. In Proceedings of the 2023 4th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 7–9 April 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 814–820. [Google Scholar]
- Yang, W.; Che, J.; Zhang, L.; Ma, M. Research of garbage salvage system based on deep learning. In Proceedings of the International Conference on Computer Application and Information Security (ICCAIS 2021), Sousse, Tunisia, 18–20 March 2021; SPIE: Bellingham, WA, USA, 2022; Volume 12260, pp. 292–298. [Google Scholar]
- Kong, S.; Tian, M.; Qiu, C.; Wu, Z.; Yu, J. IWSCR: An intelligent water surface cleaner robot for collecting floating garbage. IEEE Trans. Syst. Man Cybern. Syst. 2020, 51, 6358–6368. [Google Scholar] [CrossRef]
- Yin, X.; Lu, J.; Liu, Y. Garbage Detection on The Water Surface Based on Deep Learning. In Proceedings of the 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shijiazhuang, China, 22–24 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 679–683. [Google Scholar]
- Li, X.; Tian, M.; Kong, S.; Wu, L.; Yu, J. A modified YOLOv3 detection method for vision-based water surface garbage capture robot. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420932715. [Google Scholar] [CrossRef]
- Zhang, L.; Zhang, Y.; Zhang, Z.; Shen, J.; Wang, H. Real-time water surface object detection based on improved faster R-CNN. Sensors 2019, 19, 3523. [Google Scholar] [CrossRef] [PubMed]
- Li, N.; Huang, H.; Wang, X.; Yuan, B.; Liu, Y.; Xu, S. Detection of Floating Garbage on Water Surface Based on PC-Net. Sustainability 2022, 14, 11729. [Google Scholar] [CrossRef]
- Valdenegro-Toro, M. Submerged marine debris detection with autonomous underwater vehicles. In Proceedings of the 2016 International Conference on Robotics and Automation for Humanitarian Applications (RAHA), Kollam, India, 18–20 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–7. [Google Scholar]
- Cai, C.; Gu, S. Research on Marine Garbage Detection Based on Improved Yolov5 Model. J. Phys. Conf. Series 2022, 2405, 012008. [Google Scholar] [CrossRef]
- Guo, Y.; Lu, Y.; Guo, Y.; Liu, R.W.; Chui, K.T. Intelligent vision-enabled detection of water-surface targets for video surveillance in maritime transportation. J. Adv. Transp. 2021, 2021, 9470895. [Google Scholar] [CrossRef]
- Yang, W.; Wang, R.; Li, L. Method and System for Detecting and Recognizing Floating Garbage Moving Targets on Water Surface with Big Data Based on Blockchain Technology. Adv. Multimed. 2022, 2022, 9917770. [Google Scholar] [CrossRef]
- Yi, N.; Luo, W. Research on Water Garbage Detection Algorithm Based on GFL Network. Front. Comput. Intell. Syst. 2023, 3, 154–157. [Google Scholar] [CrossRef]
- Ai, P.; Ma, L.; Wu, B. LI-DWT-and PD-FC-MSPCNN-Based Small-Target Localization Method for Floating Garbage on Water Surfaces. Water 2023, 15, 2302. [Google Scholar] [CrossRef]
- Ma, L.; Wu, B.; Deng, J.; Lian, J. Small-target water-floating garbage detection and recognition based on UNet-YOLOv5s. In Proceedings of the 2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 14–16 April 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 391–395. [Google Scholar]
- Pan, J.Y.; Yang, H.J.; Faloutsos, C.; Duygulu, P. Automatic multimedia cross-modal correlation discovery. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 653–658. [Google Scholar]
- Wang, K.; Yin, Q.; Wang, W.; Wu, S.; Wang, L. A comprehensive survey on cross-modal retrieval. arXiv 2016, arXiv:1607.06215. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Mikolov, T.; Karafi, M.; Burget, L.; Cernock, J.; Khudanpur, S. Recurrent neural network based language model. Interspeech 2010, 2, 1045–1048. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Zhang, H.; Koh, J.Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 833–842. [Google Scholar]
- Jing, C.; Xue, B.; Pan, J. CTI-GAN: Cross-Text-Image Generative Adversarial Network for Bidirectional Cross-modal Generation. In Proceedings of the 5th International Conference on Computer Science and Software Engineering, Guilin, China, 21–23 October 2022; pp. 85–92. [Google Scholar]
- Xu, X.; Lu, H.; Song, J.; Yang, Y.; Shen, H.T.; Li, X. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans. Cybern. 2019, 50, 2400–2413. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.F.; Wang, W.Y.; Zhang, L. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6629–6638. [Google Scholar]
- Thoker, F.M.; Gall, J. Cross-modal knowledge distillation for action recognition. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6–10. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, VC, Canada, 8–14 December 2019. [Google Scholar]
- Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
Class | Number | Class | Number | Class | Number |
---|---|---|---|---|---|
Can | 497 | Leaf | 838 | Branch | 1350 |
Grass | 329 | Milk box | 359 | Plastic bag | 566 |
Bottle | 996 | Plastic box | 204 | Glass bottle | 92 |
Rotten fruit | 175 | Ball | 23 |
Class | Number | Class | Number | Class | Number |
---|---|---|---|---|---|
Can | 249 | Leaf | 352 | Branch | 761 |
Grass | 152 | Milk box | 184 | Plastic bag | 260 |
Bottle | 490 | Plastic box | 105 | Glass bottle | 41 |
Rotten fruit | 73 | Ball | 14 |
Hyperparameter Name | Value Settings |
---|---|
Optimizer | Adam |
Learning rate adjustment strategy | Step |
Warmup | Linear |
Initial learning rate | 0.0025 |
Batch sizes | 2 |
Training epochs | 500 |
Weight decay | 0.0001 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Faster R-CNN | R-50 | 47.14 | 47.89 | +0.75 |
Faster R-CNN | R-101 | 48.53 | 49.27 | +0.74 |
Faster R-CNN | R-50x4 | 49.26 | 52.51 | +3.25 |
Faster R-CNN | R-50x16 | 51.12 | 53.36 | +2.24 |
Faster R-CNN | R-50x64 | 52.28 | 53.52 | +1.24 |
Faster R-CNN | ViT-B/32 | 55.29 | 56.23 | +0.94 |
Faster R-CNN | ViT-B/16 | 55.62 | 55.74 | +0.12 |
Faster R-CNN | ViT-B/14 | 55.67 | 56.32 | +0.65 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Mask R-CNN | R-50 | 49.54 | 50.21 | +0.67 |
Mask R-CNN | R-101 | 50.21 | 51.14 | +0.93 |
Mask R-CNN | R-50x4 | 51.46 | 52.21 | +0.75 |
Mask R-CNN | R-50x16 | 51.72 | 52.96 | +1.24 |
Mask R-CNN | R-50x64 | 53.26 | 55.63 | +2.37 |
Mask R-CNN | ViT-B/32 | 54.27 | 55.23 | +0.96 |
Mask R-CNN | ViT-B/16 | 54.79 | 56.87 | +2.08 |
Mask R-CNN | ViT-B/14 | 55.25 | 56.42 | +1.17 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Cascade R-CNN | R-50 | 49.67 | 50.59 | +0.92 |
Cascade R-CNN | R-101 | 49.86 | 50.27 | +0.41 |
Cascade R-CNN | R-50x4 | 49.26 | 50.21 | +0.95 |
Cascade R-CNN | R-50x16 | 50.53 | 51.34 | +0.81 |
Cascade R-CNN | R-50x64 | 51.39 | 52.86 | +1.47 |
Cascade R-CNN | ViT-B/32 | 52.29 | 53.23 | +0.94 |
Cascade R-CNN | ViT-B/16 | 54.62 | 55.37 | +0.75 |
Cascade R-CNN (ours) | ViT-B/14 | 55.87 | 57.46 | +1.59 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Faster R-CNN | R-50 | 72.14 | 73.51 | +1.37 |
Faster R-CNN | R-101 | 73.23 | 74.22 | +0.99 |
Faster R-CNN | R-50x4 | 74.46 | 74.73 | +0.27 |
Faster R-CNN | R-50x16 | 74.52 | 75.36 | +0.84 |
Faster R-CNN | R-50x64 | 74.78 | 75.85 | +1.07 |
Faster R-CNN | ViT-B/32 | 75.19 | 76.03 | +1.84 |
Faster R-CNN | ViT-B/16 | 75.62 | 77.17 | +1.55 |
Faster R-CNN | ViT-B/14 | 75.89 | 77.72 | +1.83 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Mask R-CNN | R-50 | 73.54 | 73.91 | +0.37 |
Mask R-CNN | R-101 | 73.71 | 74.23 | +0.52 |
Mask R-CNN | R-50x4 | 73.96 | 74.57 | +0.61 |
Mask R-CNN | R-50x16 | 74.72 | 75.34 | +0.62 |
Mask R-CNN | R-50x64 | 75.27 | 76.89 | +1.62 |
Mask R-CNN | ViT-B/32 | 75.55 | 76.92 | +1.37 |
Mask R-CNN | ViT-B/16 | 76.24 | 77.25 | +0.91 |
Mask R-CNN | ViT-B/14 | 76.48 | 77.87 | +1.39 |
Algorithms | Backbone | mAP Results | mAP Results with CLIP | Improvement Value |
---|---|---|---|---|
Cascade R-CNN | R-50 | 73.96 | 74.21 | +0.25 |
Cascade R-CNN | R-101 | 74.13 | 75.12 | +0.99 |
Cascade R-CNN | R-50x4 | 74.47 | 75.51 | +1.04 |
Cascade R-CNN | R-50x16 | 75.42 | 76.36 | +0.94 |
Cascade R-CNN | R-50x64 | 75.88 | 76.64 | +0.76 |
Cascade R-CNN | ViT-B/32 | 76.99 | 77.53 | +0.54 |
Cascade R-CNN | ViT-B/16 | 77.92 | 78.47 | +0.55 |
Cascade R-CNN (ours) | ViT-B/14 | 78.31 | 79.69 | +1.38 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, Y.; Chu, Z.; Liu, H.; Zhang, Y.; Liu, C.; Li, D.; He, W. Strong and Weak Supervision Combined with CLIP for Water Surface Garbage Detection. Water 2023, 15, 3156. https://doi.org/10.3390/w15173156
Ma Y, Chu Z, Liu H, Zhang Y, Liu C, Li D, He W. Strong and Weak Supervision Combined with CLIP for Water Surface Garbage Detection. Water. 2023; 15(17):3156. https://doi.org/10.3390/w15173156
Chicago/Turabian StyleMa, Yunlin, Zhenxiong Chu, Hao Liu, Ye Zhang, Chengzhao Liu, Dexin Li, and Wei He. 2023. "Strong and Weak Supervision Combined with CLIP for Water Surface Garbage Detection" Water 15, no. 17: 3156. https://doi.org/10.3390/w15173156
APA StyleMa, Y., Chu, Z., Liu, H., Zhang, Y., Liu, C., Li, D., & He, W. (2023). Strong and Weak Supervision Combined with CLIP for Water Surface Garbage Detection. Water, 15(17), 3156. https://doi.org/10.3390/w15173156