Cascaded Cross-Layer Fusion Network for Pedestrian Detection
Abstract
:1. Introduction
- (1)
- We propose a novel Cascaded Cross-layer Fusion module (CCF) to reduce the noise information in the shallow features through high-level semantic information, and at the same time reuse high-level semantic information to strengthen the high-level semantic information in the final feature map;
- (2)
- The center map provides the confidence of each object center point, but the confidence is obtained from local information. Therefore, this paper proposes global smooth map to provide the center map with global information, thereby improving the accuracy of the center map;
- (3)
- The feasibility of CCFNet is verified on the Caltech and CityPersons Datasets.
2. Related Work
2.1. Anchor-Base and Anchor-Free
2.2. FPN-like Methods
2.3. FCN-like Methods
3. Methods
3.1. Detection Network
3.2. Cascaded Cross-Layer Fusion Module
3.3. Detection Head
3.4. Loss Function
3.4.1. Center Loss
3.4.2. Scale Loss
3.4.3. Total Loss
4. Experimental Results
4.1. Datasets
4.2. Experimental Setting
4.3. Ablation Study
4.4. State-of-the-Art Comparisons
4.5. Visualization
4.6. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 21–37. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Li, Z.; Tang, J.; Zhang, L.; Yang, J. Weakly-supervised Semantic Guided Hashing for Social Image Retrieval. Int. J. Comput. Vis. 2020, 128, 2265–2278. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
- Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. Aaai Conf. Artif. Intell. 2019, 33, 9259–9266. [Google Scholar] [CrossRef]
- Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature pyramid transformer. In Proceedings of the European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2020; pp. 323–339. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Huang, L.; Yang, Y.; Deng, Y.; Yu, Y. Densebox: Unifying landmark localization with end to end object detection. arXiv 2015, arXiv:1509.04874. [Google Scholar]
- Zhu, Z.; Li, Z. Online Video Object Detection via Local and Mid-Range Feature Propagation. In Proceedings of the 1st International Workshop on Human-Centric Multimedia Analysis, Seattle WA, USA, 10–14 October 2020; pp. 73–82. [Google Scholar]
- Li, Z.; Tang, J.; Mei, T. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2070–2083. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
- Zhou, H.; Li, Z.; Ning, C.; Tang, J. Cad: Scale invariant framework for real-time object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 760–768. [Google Scholar]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Li, Z.; Sun, Y.; Tang, J. CTNet: Context-based Tandem Network for Semantic Segmentation. arXiv 2021, arXiv:2104.09805. [Google Scholar] [CrossRef]
- Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5187–5196. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
- Sun, Y.; Li, Z. SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense Predictions without Cost. arXiv 2021, arXiv:2111.03392. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. Cornernet-lite: Efficient keypoint based object detection. arXiv 2019, arXiv:1904.08900. [Google Scholar]
- Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
- Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [Google Scholar] [CrossRef]
- Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Song, T.; Sun, L.; Xie, D.; Sun, H.; Pu, S. Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 536–551. [Google Scholar]
- Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7774–7783. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Dollár, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 304–311. [Google Scholar]
- Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 743–761. [Google Scholar] [CrossRef]
- Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
- Hasan, I.; Liao, S.; Li, J.; Akram, S.U.; Shao, L. Generalizable Pedestrian Detection: The Elephant in the Room. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11328–11337. [Google Scholar]
- Wang, W. Adapted Center and Scale Prediction: More Stable and More Accurate. arXiv 2020, arXiv:2002.09053. [Google Scholar]
- Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv 2014, arXiv:1404.5997. [Google Scholar]
- Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
- Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided context feature pyramid network for object detection. arXiv 2020, arXiv:2005.11475. [Google Scholar]
- Liu, W.; Liao, S.; Hu, W.; Liang, X.; Chen, X. Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 618–634. [Google Scholar]
- Pang, Y.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-guided attention network for occluded pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4967–4975. [Google Scholar]
- Mao, J.; Xiao, T.; Jiang, Y.; Cao, Z. What can help pedestrian detection? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3127–3136. [Google Scholar]
- Zhang, J.; Lin, L.; Zhu, J.; Li, Y.; Chen, Y.c.; Hu, Y.; Hoi, C.S. Attribute-aware pedestrian detection in a crowd. IEEE Trans. Multimed. 2020, 23, 3085–3097. [Google Scholar] [CrossRef]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhang, D.; Zhang, H.; Tang, J.; Hua, X.S.; Sun, Q. Self-Regulation for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canada, 11–17 October 2021; pp. 6953–6963. [Google Scholar]
- Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
Subsets | Height | Visibility |
---|---|---|
Reasonable | [50, ] | [0.65, ] |
Bare | [50, ] | [0.90, ] |
Partial | [50, ] | [0.65, 0.90] |
Heavy | [50, ] | [0, 0.65] |
Reasonable_Occ=Heavy | [50, ] | [0.2, 0.65] |
Feature Maps | Backbone | Reasonable | Bare | Partial | Heavy | |||
---|---|---|---|---|---|---|---|---|
√ | √ | - | - | ResNet-50 | 29.4 | 22.8 | 26.9 | 67.0 |
- | √ | √ | - | ResNet-50 | 16.6 | 12.3 | 15.4 | 55.2 |
- | - | √ | √ | ResNet-50 | 15.5 | 10.3 | 15.4 | 56.3 |
√ | √ | √ | - | ResNet-50 | 16.3 | 12.4 | 15.3 | 54.4 |
- | √ | √ | √ | ResNet-50 | 15.4 | 10.8 | 14.6 | 53.7 |
√ | √ | √ | √ | ResNet-50 | 10.6 | 7.1 | 10.1 | 48.4 |
Reasonable | Bare | Partial | Heavy | |
---|---|---|---|---|
ResNet-50 + FPN | 11.9 | 8.1 | 11.6 | 48.6 |
ResNet-50 + AugFPN | 11.9 | 8.5 | 11.7 | 50.2 |
ResNet-50 + ACFPN | 11.8 | 8.2 | 11.2 | 50.7 |
ResNet-50 + CSP | 11.2 | 7.7 | 10.6 | 45.7 |
ResNet-50 + CCF | 10.6 | 7.1 | 10.1 | 48.4 |
Backbone | Reasonable | Bare | Partial | Heavy | |
---|---|---|---|---|---|
Baseline | ResNet-50 | 11.2 | 7.3 | 10.8 | 50.3 |
Baseline + GSMap | ResNet-50 | 10.5 | 7.0 | 10.0 | 46.6 |
Baseline + CCF | ResNet-50 | 10.6 | 7.1 | 10.1 | 48.4 |
Baseline + CCF + GSMap | ResNet-50 | 10.2 | 6.8 | 9.5 | 42.7 |
Scale Prediction | Backbone | Reasonable | Bare | Partial | Heavy |
---|---|---|---|---|---|
Height | ResNet-50 | 10.8 | 7.2 | 10.7 | 47.2 |
Width | ResNet-50 | 11.4 | 8.1 | 11.0 | 49.9 |
Height + Width | ResNet-50 | 10.2 | 6.8 | 9.5 | 42.7 |
Reasonable | Reasonable_Occ=Heavy | |
---|---|---|
ALFNet [52] | 6.1 | 51.0 |
MGAN [53] | 6.8 | 38.2 |
HyperLearner [54] | 5.5 | 48.7 |
RepLoss [38] | 5.0 | 47.9 |
CSP [23] | 4.5 | 45.8 |
CCFNet (ours) | 4.3 | 43.2 |
ALFNet + city [23,52] | 4.5 | 43.4 |
RepLoss + city [23,38] | 4.0 | 41.8 |
CSP + city [23] | 3.8 | 36.5 |
CCFNet + city (ours) | 3.5 | 36.2 |
Backbone | Reasonable | Bare | Partial | Heavy | |
---|---|---|---|---|---|
TLL [37] | ResNet-50 | 15.5 | 10.0 | 17.2 | 53.6 |
TLL + MRF [37] | ResNet-50 | 14.4 | 9.2 | 15.9 | 52.0 |
RepLoss [38] | ResNet-50 | 13.2 | 7.6 | 16.8 | 56.9 |
OR-CNN [56] | VGG-16 | 12.8 | 6.7 | 15.3 | 55.7 |
ALFNet [52] | ResNet-50 | 12.0 | 8.4 | 11.4 | 51.9 |
CSP [23] | ResNet-50 | 11.0 | 7.3 | 10.4 | 49.3 |
APD [55] | ResNet-50 | 10.6 | 7.1 | 9.5 | 49.8 |
MGAN [53] | VGG-16 | 10.5 | - | - | 47.2 |
CCFNet (ours) | ResNet-50 | 10.2 | 6.8 | 9.5 | 42.7 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ding, Z.; Gu, Z.; Sun, Y.; Xiang, X. Cascaded Cross-Layer Fusion Network for Pedestrian Detection. Mathematics 2022, 10, 139. https://doi.org/10.3390/math10010139
Ding Z, Gu Z, Sun Y, Xiang X. Cascaded Cross-Layer Fusion Network for Pedestrian Detection. Mathematics. 2022; 10(1):139. https://doi.org/10.3390/math10010139
Chicago/Turabian StyleDing, Zhifeng, Zichen Gu, Yanpeng Sun, and Xinguang Xiang. 2022. "Cascaded Cross-Layer Fusion Network for Pedestrian Detection" Mathematics 10, no. 1: 139. https://doi.org/10.3390/math10010139
APA StyleDing, Z., Gu, Z., Sun, Y., & Xiang, X. (2022). Cascaded Cross-Layer Fusion Network for Pedestrian Detection. Mathematics, 10(1), 139. https://doi.org/10.3390/math10010139