Local Attention Sequence Model for Video Object Detection
Abstract
:1. Introduction
2. Related Work
2.1. Image Object Detection
2.2. Video Object Detection
2.3. Self-Attention
3. Kinematic Control of CDHRM
3.1. Overview
3.2. Spatial Attention
3.3. Local Attention Sequence Model
3.4. Modified ConvGRU
4. Experiments
4.1. Dataset and Setup
4.2. Results
4.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef] [Green Version]
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. arXiv 2019, arXiv:1905.05055. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Häusser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
- Huang, B.; Huang, H.; Lu, H. Convolutional Gated Recurrent Units Fusion for Video Action Recognition. In International Conference on Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2017; pp. 114–123. [Google Scholar]
- Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X.; et al. T-CNN: Tubelets with Convolutional Neural Networks for Object Detection From Videos. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 2896–2907. [Google Scholar] [CrossRef] [Green Version]
- Han, W.; Khorrami, P.; Paine, T.L.; Ramachandran, P.; Babaeizadeh, M.; Shi, H.; Li, J.; Yan, S.; Huang, T.S. Seq-NMS for Video Object Detection. arXiv 2016, arXiv:1602.08465. [Google Scholar]
- Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; Wei, Y. Flow-Guided Feature Aggregation for Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 408–417. [Google Scholar]
- Zhu, X.; Dai, J.; Yuan, L.; Wei, Y. Towards High Performance Video Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7210–7218. [Google Scholar]
- Liu, M.; Zhu, M. Mobile Video Object Detection with Temporally-Aware Feature Maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5686–5695. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Adv. Neural Inf. Process. Syst. 2015, 2015, 802–810. [Google Scholar]
- Xiao, F.; Lee, Y.J. Video Object Detection with an Aligned Spatial-Temporal Memory. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 485–501. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems; The MIT Press: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. arXiv 2015, arXiv:1506.02025. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
mAp | P | R | F1 | Airplane | Antelope | Bear | Bike | Bird | Bsus | ||
---|---|---|---|---|---|---|---|---|---|---|---|
YOLO | 33.77 | 48.03 | 28.55 | 35.81 | 36.62 | 11.46 | 22.62 | 36.93 | 31.97 | 56.55 | |
LA | 35.57 | 55.23 | 29.43 | 38.4 | 70.28 | 25.34 | 15.22 | 42.06 | 41.35 | 47.95 | |
car | cattle | dog | cat | elephant | fox | panda | hamster | horse | lion | lizard | monkey |
20.34 | 46.66 | 8.26 | 48.47 | 29.98 | 47.34 | 44.85 | 79.73 | 55.04 | 3.99 | 5.19 | 0.65 |
21.92 | 40.02 | 3.68 | 35.99 | 26.53 | 50.33 | 50.01 | 83.36 | 44.45 | 1.95 | 14.93 | 0.78 |
moto | rabbit | redpanda | sheep | snake | squirrel | tiger | train | turtle | boat | whale | zebra |
55.48 | 10.57 | 10.47 | 70.11 | 20.14 | 0.3 | 15.77 | 93.81 | 34.74 | 63.91 | 13.42 | 37.82 |
49.16 | 13.03 | 18.74 | 71.27 | 25.31 | 0.54 | 17.56 | 89.73 | 38.01 | 72.37 | 21.15 | 34.11 |
YOLO | |||||
---|---|---|---|---|---|
ConvGRU | ✓ | ||||
Modified ConvGRU | ✓ | ✓ | |||
Spatial & Local Attention | ✓ | ✓ | |||
mAP | 33.773 | 35.026 | 34.867 | 34.467 | 35.571 |
time/ms | 4.731 | 6.701 | 5.636 | 5.019 | 5.882 |
FLOPs/G | 0.39 | 0.512 | 0.427 | 0.402 | 0.439 |
Parameters/M | 1.013 | 1.718 | 1.188 | 1.027 | 1.202 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Zhuang, X.; Wang, H.; Nie, Y.; Tang, J. Local Attention Sequence Model for Video Object Detection. Appl. Sci. 2021, 11, 4561. https://doi.org/10.3390/app11104561
Li Z, Zhuang X, Wang H, Nie Y, Tang J. Local Attention Sequence Model for Video Object Detection. Applied Sciences. 2021; 11(10):4561. https://doi.org/10.3390/app11104561
Chicago/Turabian StyleLi, Zhenhui, Xiaoping Zhuang, Haibo Wang, Yong Nie, and Jianzhong Tang. 2021. "Local Attention Sequence Model for Video Object Detection" Applied Sciences 11, no. 10: 4561. https://doi.org/10.3390/app11104561
APA StyleLi, Z., Zhuang, X., Wang, H., Nie, Y., & Tang, J. (2021). Local Attention Sequence Model for Video Object Detection. Applied Sciences, 11(10), 4561. https://doi.org/10.3390/app11104561