UWV-Yolox: A Deep Learning Model for Underwater Video Object Detection
Abstract
:1. Introduction
- We use the Contrast Limited Adaptive Histogram Equalization method to enhance the contrast of underwater videos, making the model more suitable for underwater detection. We also propose a new CSP_CA module by integrating Coordinate Attention to augment the representations of the objects of interest;
- We propose a new loss function, which includes classification loss, regression loss, confidence loss, and jitter loss, to improve the model’s performance in video detection;
- We propose a frame-level optimization module, which uses tubelet linking, re-scoring, and re-coordinating to optimize the model’s results in video object detection.
2. Related Work
2.1. Image-Level Underwater Object Detection
2.2. Video-Level Object Detection
3. Methods
3.1. Video Contrast Enhancement
3.2. CSP_CA Module
3.3. Loss Function
3.4. Frame-Level Optimization
4. Results
4.1. Experimental Environment Configuration
4.2. Dataset
4.3. Experimental Results
4.3.1. Evaluation Metrics
4.3.2. Experimental Results with Different Data Augmentation Methods
4.3.3. Experimental Results with Individual Improvement
4.3.4. Results of the Ablation Experiments
4.3.5. Experimental Results of Different Model
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
- Zuiderveld, K. Contrast Limited Adaptive Histogram Equalization. In Graphic Gems IV; Academic Press Professional: San Diego, CA, USA, 1994; pp. 474–485. [Google Scholar]
- Iqbal, K.; Salam, R.A.; Osman, A.; Talib, A.Z. Underwater Image Enhancement Using an Integrated Colour Model. IAENG Int. J. Comput. Sci. 2007, 34, 2. [Google Scholar]
- Huang, D.; Wang, Y.; Song, W.; Sequeira, J.; Mavromatis, S. Shallow-water image enhancement using relative global histogram stretching based on adaptive parameter acquisition. In Proceedings of the MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, 5–7 February 2018; pp. 453–465. [Google Scholar]
- Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
- Fu, M.; Liu, H.; Yu, Y.; Chen, J.; Wang, K. Dw-gan: A discrete wavelet transform gan for nonhomogeneous dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 203–212. [Google Scholar]
- Liu, X.; Zhang, T.; Zhang, J. Toward visual quality enhancement of dehazing effect with improved Cycle-GAN. Neural Comput. Appl. 2022, 35, 5277–5290. [Google Scholar] [CrossRef]
- Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
- Zhang, H.; Wu, J.; Yu, H.; Wang, W.; Zhang, Y.; Zhou, Y. An underwater fish individual recognition method based on improved YoloV4 and FaceNet. In Proceedings of the 2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), London, UK, 20–21 December 2021; pp. 196–200. [Google Scholar]
- Li, S.; Pan, B.; Cheng, Y.; Yan, X.; Wang, C.; Yang, C. Underwater Fish Object Detection based on Attention Mechanism improved Ghost-YOLOv5. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; pp. 599–603. [Google Scholar]
- Jiao, L.; Zhang, R.; Liu, F.; Yang, S.; Hou, B.; Li, L.; Tang, X. New generation deep learning for video object detection: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3195–3215. [Google Scholar] [CrossRef] [PubMed]
- Han, W.; Khorrami, P.; Paine, T.L.; Ramachandran, P.; Babaeizadeh, M.; Shi, H.; Li, J.; Yan, S.; Huang, T.S. Seq-nms for Video Object Detection. arXiv 2016, arXiv:1602.08465. [Google Scholar]
- Patraucean, V.; Handa, A.; Cipolla, R. Spatio-Temporal Video Autoencoder with Differentiable Memory. arXiv 2015, arXiv:1511.06309. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3038–3046. [Google Scholar]
- Chai, Y. Patchwork: A patch-wise attention network for efficient object detection and segmentation in video streams. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3415–3424. [Google Scholar]
- Wang, T.; Xiong, J.; Xu, X.; Shi, Y. SCNN: A general distribution based statistical convolutional neural network with application to video object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5321–5328. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Kang, K.; Ouyang, W.; Li, H.; Wang, X. Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 817–825. [Google Scholar]
- He, L.; Zhou, Q.; Li, X.; Niu, L.; Cheng, G.; Li, X.; Liu, W.; Tong, Y.; Ma, L.; Zhang, L. End-to-end video object detection with spatial-temporal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1507–1516. [Google Scholar]
- Zhao, W.; Zhang, J.; Li, L.; Barnes, N.; Liu, N.; Han, J. Weakly supervised video salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16826–16835. [Google Scholar]
- Wen, G.; Li, S.; Liu, F.; Luo, X.; Er, M.J.; Mahmud, M.; Wu, T. YOLOv5s-CA: A Modified YOLOv5s Network with Coordinate Attention for Underwater Target Detection. Sensors 2023, 23, 3367. [Google Scholar] [CrossRef] [PubMed]
- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Pedersen, M.; Bruslund Haurum, J.; Gade, R.; Moeslund, T.B. Detection of marine animals in a new underwater dataset with varying visibility. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 18–26. [Google Scholar]
- Jiang, L.; Wang, Y.; Jia, Q.; Xu, S.; Liu, Y.; Fan, X.; Li, H.; Liu, R.; Xue, X.; Wang, R. Underwater species detection using channel sharpening attention. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4259–4267. [Google Scholar]
- Liu, H.; Song, P.; Ding, R. Towards domain generalization in underwater object detection. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual Conference, 25–28 October 2020; pp. 1971–1975. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Ancuti, C.; Ancuti, C.O.; Haber, T.; Bekaert, P. Enhancing underwater images and videos by fusion. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 81–88. [Google Scholar]
- Wang, Y.; Yan, X.; Guan, D.; Wei, M.; Chen, Y.; Zhang, X.P.; Li, J. Cycle-snspgan: Towards real-world image dehazing via cycle spectral normalized soft likelihood estimation patch gan. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20368–20382. [Google Scholar] [CrossRef]
- Zhou, Q.; Li, X.; He, L.; Yang, Y.; Cheng, G.; Tong, Y.; Ma, L.; Tao, D. TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers. arXiv 2022, arXiv:2201.05047. [Google Scholar] [CrossRef] [PubMed]
- Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
- Shi, Y.; Wang, N.; Guo, X. YOLOV: Making Still Image Object Detectors Great at Video Object Detection. arXiv 2022, arXiv:2208.09686. [Google Scholar]
The Number of Videos | The Number of Images | The Number of Objects | |
---|---|---|---|
Train | 59 | 6886 | 15,141 |
Test | 15 | 1773 | 5930 |
Sum | 74 | 8659 | 21,071 |
Model | Augmentation Methods | [email protected](%) |
---|---|---|
Yolox | None | 85.8 |
Improved Yolox | Fusion [27] | 81.4 (−4.4) |
Improved Yolox | Cycle-SNSPGAN [28] | 82.9 (−2.9) |
Improved Yolox | RGSH | 86.0 (+0.2) |
UWV-Yolox | CLAHE | 87.0 (+1.2) |
Methods | [email protected](%) | Parameters |
---|---|---|
baseline | 85.8 | 99.00 M |
+ Regression loss | 86.0 (+0.2) | 99.00 M |
+ Coordinate Attention | 86.5 (+0.7) | 82.66 M |
+ CLAHE | 87.0 (+1.2) | 99.00 M |
+ Frame-level optimization | 87.3 (+1.5) | 99.00 M |
+ Jitter loss | 88.2 (+2.4) | 99.00 M |
CLAHE | CA Module | Jitter Loss | Regression Loss | Frame-Level Optimization | [email protected](%) |
---|---|---|---|---|---|
85.8 | |||||
✓ | 87.0 (+1.2) | ||||
✓ | ✓ | 87.5 (+0.5) | |||
✓ | ✓ | ✓ | 88.0 (+0.5) | ||
✓ | ✓ | ✓ | ✓ | 88.8 (+0.8) | |
✓ | ✓ | ✓ | ✓ | ✓ | 89.0 (+0.2) |
Model | Backbone | Input Size | Batch Size | [email protected](%) | FPS |
---|---|---|---|---|---|
TransVOD_Lite [29] | ResNet-101 | 600 × 600 | 4 | 69.0 | 14.9 |
Boosting R-CNN [30] | ResNet-50 | 1333 × 800 | 4 | 77.4 | 25.4 |
Yolov5 | CSPDarknet | 640 × 640 | 32 | 82.3 | 156.2 |
YOLOV [31] | CSPDarknet | 640 × 640 | 32 | 85.0 | 104.1 |
Yolox | CSPDarknet | 640 × 640 | 16 | 85.8 | 75.6 |
UWV-Yolox | CA_CSPDarknet | 640 × 640 | 16 | 89.0 | 71.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pan, H.; Lan, J.; Wang, H.; Li, Y.; Zhang, M.; Ma, M.; Zhang, D.; Zhao, X. UWV-Yolox: A Deep Learning Model for Underwater Video Object Detection. Sensors 2023, 23, 4859. https://doi.org/10.3390/s23104859
Pan H, Lan J, Wang H, Li Y, Zhang M, Ma M, Zhang D, Zhao X. UWV-Yolox: A Deep Learning Model for Underwater Video Object Detection. Sensors. 2023; 23(10):4859. https://doi.org/10.3390/s23104859
Chicago/Turabian StylePan, Haixia, Jiahua Lan, Hongqiang Wang, Yanan Li, Meng Zhang, Mojie Ma, Dongdong Zhang, and Xiaoran Zhao. 2023. "UWV-Yolox: A Deep Learning Model for Underwater Video Object Detection" Sensors 23, no. 10: 4859. https://doi.org/10.3390/s23104859
APA StylePan, H., Lan, J., Wang, H., Li, Y., Zhang, M., Ma, M., Zhang, D., & Zhao, X. (2023). UWV-Yolox: A Deep Learning Model for Underwater Video Object Detection. Sensors, 23(10), 4859. https://doi.org/10.3390/s23104859