Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues
Abstract
:1. Introduction
- We design a lightweight object shape extraction module that exploits both the high-level and low-level features to obtain object shape information. And the shape information is then used to further refine the predicted masks.
- We introduce a novel object-level motion prediction module that stores the representative motion features during the training stage, and predicts the object motion by retrieving them during the inference stage, by which our method achieves high efficiency and strong robustness.
2. Related Works
2.1. Semi-Supervised VOS
2.2. Matching-Based VOS Networks
2.3. VOS Network with Spatio-Temporal Constraint
3. Methods
3.1. Network Overview
3.1.1. Encoders
3.1.2. Memory Reading
3.1.3. Decoder
3.2. Object Shape Extraction Module
3.2.1. Utilizing Backbone Network Information
3.2.2. Gated Convolutional Layer
3.2.3. Shape Spatio-Temporal Constraints
3.3. Object Motion Prediction Module
3.3.1. Motion Signature Memory Bank
3.3.2. Motion Prediction Using Memory
3.3.3. Motion Fusion
4. Results
4.1. Datasets and Metrics
4.1.1. DAVIS
4.1.2. YouTube-VOS
4.1.3. Metrics
4.2. Implementation Details
4.3. Comparison with Other Methods
4.3.1. Quantitative Comparison
4.3.2. Qualitative Comparison
4.4. Ablation Study
5. Discussion and Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Oh, S.W.; Lee, J.Y.; Xu, N.; Kim, S.J. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9226–9235. [Google Scholar]
- Cheng, H.K.; Tai, Y.W.; Tang, C.K. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 11781–11794. [Google Scholar]
- Yang, Z.; Wei, Y.; Yang, Y. Collaborative video object segmentation by foreground-background integration. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 332–348. [Google Scholar]
- Yang, Z.; Wei, Y.; Yang, Y. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4704–4712. [Google Scholar] [CrossRef] [PubMed]
- Li, M.; Hu, L.; Xiong, Z.; Zhang, B.; Pan, P.; Liu, D. Recurrent Dynamic Embedding for Video Object Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1332–1341. [Google Scholar]
- Li, Y.; Shen, Z.; Shan, Y. Fast video object segmentation using the global context module. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 735–750. [Google Scholar]
- Hu, L.; Zhang, P.; Zhang, B.; Pan, P.; Xu, Y.; Jin, R. Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4144–4154. [Google Scholar]
- Xie, H.; Yao, H.; Zhou, S.; Zhang, S.; Sun, W. Efficient regional memory network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1286–1295. [Google Scholar]
- Seong, H.; Hyun, J.; Kim, E. Kernelized memory network for video object segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 629–645. [Google Scholar]
- Chen, Y.; Zhang, D.; Yang, Z.x.; Wu, E. Robust and Efficient Memory Network for Video Object Segmentation. arXiv 2023, arXiv:2304.11840. [Google Scholar]
- Chen, Y.; Zhang, D.; Zheng, Y.; Yang, Z.X.; Wu, E.; Zhao, H. Boosting Video Object Segmentation via Robust and Efficient Memory Network. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
- Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
- Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 davis challenge on video object segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
- Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; Huang, T. Youtube-vos: A large-scale video object segmentation benchmark. arXiv 2018, arXiv:1809.03327. [Google Scholar]
- Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. Mots: Multi-object tracking and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7942–7951. [Google Scholar]
- Chen, X.; Li, Z.; Yuan, Y.; Yu, G.; Shen, J.; Qi, D. State-aware tracker for real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9384–9393. [Google Scholar]
- Cheng, H.K.; Schwing, A.G. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Seong, H.; Oh, S.W.; Lee, J.Y.; Lee, S.; Lee, S.; Kim, E. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12889–12898. [Google Scholar]
- Yang, Z.; Wei, Y.; Yang, Y. Associating objects with transformers for video object segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 2491–2502. [Google Scholar]
- Liu, Q.; Wu, J.; Jiang, Y.; Bai, X.; Yuille, A.L.; Bai, S. InstMove: Instance Motion for Object-centric Video Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6344–6354. [Google Scholar]
- Chen, Y.; Hao, C.; Yang, Z.X.; Wu, E. Fast target-aware learning for few-shot video object segmentation. Sci. China Inf. Sci. 2022, 65, 182104. [Google Scholar] [CrossRef]
- Ye, Q.; Huang, P.; Zhang, Z.; Zheng, Y.; Fu, L.; Yang, W. Multiview learning with robust double-sided twin SVM. IEEE Trans. Cybern. 2021, 52, 12745–12758. [Google Scholar] [CrossRef] [PubMed]
- Fu, L.; Li, Z.; Ye, Q.; Yin, H.; Liu, Q.; Chen, X.; Fan, X.; Yang, W.; Yang, G. Learning Robust Discriminant Subspace Based on Joint L2,p- and L2,s-Norm Distance Metrics. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 130–144. [Google Scholar] [CrossRef] [PubMed]
- Miles, R.; Yucel, M.K.; Manganelli, B.; Saà-Garriga, A. MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 10480–10490. [Google Scholar]
- Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar]
- Zhang, P.; Hu, L.; Zhang, B.; Pan, P.; Alibaba, D. Spatial consistent memory network for semi-supervised video object segmentation. In Proceedings of the CVPR Workshops, Seattle, WA, USA, 14–19 June 2020; Volume 6, p. 2. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Oh, S.W.; Lee, J.Y.; Sunkavalli, K.; Kim, S.J. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7376–7385. [Google Scholar]
- Rempe, D.; Birdal, T.; Hertzmann, A.; Yang, J.; Sridhar, S.; Guibas, L.J. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11488–11499. [Google Scholar]
- Lee, S.; Kim, H.G.; Choi, D.H.; Kim, H.I.; Ro, Y.M. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3054–3063. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 724–732. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2663–2672. [Google Scholar]
- Cho, S.; Lee, H.; Lee, M.; Park, C.; Jang, S.; Kim, M.; Lee, S. Tackling background distraction in video object segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 446–462. [Google Scholar]
- Cheng, H.K.; Tai, Y.W.; Tang, C.K. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5559–5568. [Google Scholar]
- Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; Katsaggelos, A.K. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6499–6507. [Google Scholar]
- Ventura, C.; Bellver, M.; Girbau, A.; Salvador, A.; Marques, F.; Giro-i Nieto, X. Rvos: End-to-end recurrent network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5277–5286. [Google Scholar]
- Luiten, J.; Voigtlaender, P.; Leibe, B. Premvos: Proposal-generation, refinement and merging for video object segmentation. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 565–580. [Google Scholar]
- Zhang, Y.; Wu, Z.; Peng, H.; Lin, S. A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6949–6958. [Google Scholar]
- Wang, H.; Jiang, X.; Ren, H.; Hu, Y.; Bai, S. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1296–1305. [Google Scholar]
- Lu, X.; Wang, W.; Danelljan, M.; Zhou, T.; Shen, J.; Gool, L.V. Video object segmentation with episodic graph memory networks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 661–679. [Google Scholar]
- Bhat, G.; Lawin, F.J.; Danelljan, M.; Robinson, A.; Felsberg, M.; Gool, L.V.; Timofte, R. Learning what to learn for video object segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 777–794. [Google Scholar]
- Chen, Y.; Ji, C.; Yang, Z.X.; Wu, E. Spatial constraint for efficient semi-supervised video object segmentation. Comput. Vis. Image Underst. 2023, 237, 103843. [Google Scholar] [CrossRef]
- Duke, B.; Ahmed, A.; Wolf, C.; Aarabi, P.; Taylor, G.W. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5912–5921. [Google Scholar]
- Voigtlaender, P.; Leibe, B. Online adaptation of convolutional neural networks for video object segmentation. arXiv 2017, arXiv:1706.09364. [Google Scholar]
- Voigtlaender, P.; Chai, Y.; Schroff, F.; Adam, H.; Leibe, B.; Chen, L.C. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9481–9490. [Google Scholar]
Method | OL | M | |||
---|---|---|---|---|---|
OSMN [38] | ✓ | 54.8 | 52.5 | 57.1 | |
RGMR [28] | 66.7 | 64.8 | 68.6 | ||
RVOS [39] | 50.3 | 48.0 | 52.6 | ||
Track-Seg [16] | 72.3 | 68.6 | 76.0 | ||
PReMVOS [40] | ✓ | 77.8 | 73.9 | 81.7 | |
TVOS [41] | 72.3 | 69.9 | 74.7 | ||
GC [6] | ✓ | 71.4 | 69.3 | 73.5 | |
SwiftNet [42] | ✓ | 81.1 | 78.3 | 83.9 | |
STM [1] | ✓ | 81.8 | 79.2 | 84.3 | |
GraphMem [43] | ✓ | 82.8 | 80.2 | 85.2 | |
MiVOS [37] | ✓ | 83.3 | 80.6 | 85.9 | |
CFBI [3] | 81.9 | 79.1 | 84.6 | ||
KMN [9] | ✓ | 82.8 | 80.0 | 85.6 | |
RMNet [8] | ✓ | 83.5 | 81.0 | 86.0 | |
LWL [44] | ✓ | 81.6 | 79.1 | 84.1 | |
CFBI+ [4] | 82.9 | 80.1 | 85.7 | ||
LCM [7] | ✓ | 83.5 | 80.5 | 86.5 | |
STCN [2] | ✓ | 85.4 | 82.2 | 88.6 | |
RDE [5] | ✓ | 84.2 | 80.8 | 87.5 | |
SCE [45] | ✓ | 84.2 | 80.8 | 87.5 | |
Miles [24] | ✓ | 83.7 | 80.2 | 87.1 | |
InstMove [20] | ✓ | 85.1 | 82.3 | 87.9 | |
Ours | ✓ | 85.2 | 82.3 | 88.0 |
Method | M | |||||
---|---|---|---|---|---|---|
STM [1] | ✓ | 79.2 | 79.6 | 83.6 | 73.0 | 80.6 |
MiVOS [37] | ✓ | 82.4 | 80.6 | 84.7 | 78.2 | 85.9 |
CFBI [3] | 81.0 | 80.6 | 85.1 | 75.2 | 83.0 | |
SST [46] | 81.8 | 80.9 | - | 76.6 | - | |
STCN [2] | ✓ | 82.7 | 81.1 | 85.4 | 78.2 | 85.9 |
RDE [5] | ✓ | 81.9 | 81.1 | 85.5 | 76.2 | 84.8 |
SCE [45] | ✓ | 81.6 | 81.3 | 85.7 | 75.0 | 83.4 |
Miles [24] | ✓ | 82.3 | 81.6 | 86.0 | 76.3 | 85.2 |
InstMove [20] | ✓ | 83.4 | 82.5 | 86.9 | 77.9 | 86.0 |
Ours | ✓ | 83.5 | 82.4 | 87.0 | 78.3 | 86.3 |
Method | OL | M | & | ||
---|---|---|---|---|---|
OSMN [38] | ✓ | 41.3 | 37.3 | 44.9 | |
OnAVOS [47] | ✓ | 56.5 | 53.4 | 59.6 | |
FEELVOS [48] | 57.8 | 55.2 | 60.5 | ||
PReMVOS [40] | ✓ | 71.6 | 67.5 | 75.7 | |
STM [1] | ✓ | 72.2 | 69.3 | 75.2 | |
RMNet [8] | ✓ | 75.0 | 71. | 78.1 | |
CFBI [3] | 76.6 | 73.0 | 80.1 | ||
KMN [9] | ✓ | 77.2 | 74.1 | 80.3 | |
CFBI+ [4] | 78.0 | 74.4 | 81.6 | ||
LCM [7] | ✓ | 78.1 | 74.4 | 81.8 | |
STCN [2] | ✓ | 76.1 | 72.6 | 79.6 | |
Ours | ✓ | 77.5 | 74.0 | 81.0 |
Method | OL | M | & | ||
---|---|---|---|---|---|
OSMN [38] | ✓ | 73.5 | 74.0 | 72.9 | |
MaskTrack [35] | 77.6 | 79.7 | 75.4 | ||
FEELVOS [48] | 81.7 | 81.1 | 82.2 | ||
RGMP [28] | 81.8 | 81.5 | 82.0 | ||
Track-Seg [16] | 83.1 | 82.6 | 83.6 | ||
OnAVOS [47] | ✓ | 85.5 | 86.1 | 84.9 | |
PReMVOS [40] | ✓ | 86.8 | 84.9 | 88.6 | |
GC [6] | ✓ | 86.8 | 87.6 | 85.7 | |
RMNet [8] | ✓ | 88.8 | 88.9 | 88.7 | |
STM [1] | ✓ | 89.3 | 88.7 | 89.9 | |
CFBI [3] | 89.4 | 88.3 | 90.5 | ||
CFBI+ [4] | 89.9 | 88.7 | 91.1 | ||
MiVOS [37] | ✓ | 90.0 | 88.9 | 91.9 | |
SwiftNet [42] | ✓ | 90.4 | 90.5 | 90.3 | |
KMN [9] | ✓ | 90.5 | 89.5 | 91.5 | |
LCM [7] | ✓ | 90.7 | 91.4 | 89.9 | |
STCN [2] | ✓ | 91.6 | 90.8 | 92.5 | |
RDE [5] | ✓ | 91.1 | 89.7 | 92.5 | |
SCE [45] | ✓ | 90.8 | 91.7 | 89.9 | |
Miles [24] | ✓ | 90.6 | 89.7 | 91.6 | |
Ours | ✓ | 91.6 | 90.6 | 92.7 |
Method | & | ||
---|---|---|---|
Without OSEM and OMPM | 82.5 | 79.3 | 85.7 |
With OSEM | 83.7 | 80.3 | 87.0 |
With OMPM | 83.9 | 81.3 | 86.5 |
With OSEM and OMPM | 84.2 | 81.3 | 87.1 |
Method | & | ||
---|---|---|---|
With shape spatio-temporal constraints | 83.4 | 80.3 | 86.4 |
With shape-based refinement module | 83.5 | 80.1 | 86.8 |
With both | 83.7 | 80.3 | 87.0 |
Method | & | ||
---|---|---|---|
Pretraining only | 76.9 | 74.4 | 79.5 |
Main training only | 84.2 | 81.3 | 87.1 |
Both | 85.2 | 82.3 | 88.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, J.; Zhao, Y.; Zhang, D.; Chen, Y. Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues. Appl. Sci. 2024, 14, 2002. https://doi.org/10.3390/app14052002
Peng J, Zhao Y, Zhang D, Chen Y. Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues. Applied Sciences. 2024; 14(5):2002. https://doi.org/10.3390/app14052002
Chicago/Turabian StylePeng, Jidong, Yibing Zhao, Dingwei Zhang, and Yadang Chen. 2024. "Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues" Applied Sciences 14, no. 5: 2002. https://doi.org/10.3390/app14052002
APA StylePeng, J., Zhao, Y., Zhang, D., & Chen, Y. (2024). Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues. Applied Sciences, 14(5), 2002. https://doi.org/10.3390/app14052002