CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting
Abstract
:1. Introduction
- We propose a novel method called CC-DETR which extends the recent DETR-based object detection framework into crowd counting task.
- To handle complex semantics we propose the HCDETR module, which utilizes a DETR-like encoder–decoder structure for hybrid context understanding. By employing features from hybrid levels of the backbone as inputs for the encoder and decoder, it accomplishes simultaneous learning of global features and fusion across different scales.
- We propose a regression head with a CDCM module, which integrates coordinate convolution and parallel dilated convolution with different dilation factors to achieve location-sensitive multi-scale information modeling.
- We conduct extensive experiments on multiple benchmark datasets to demonstrate the effectiveness of the proposed method, revealing that CC-DETR outperforms several SOTA crowd counting methods.
2. Related Works
2.1. Detection-Based Methods
2.2. Density-Based Methods
3. Our Method
3.1. Backbone
3.2. Hybrid Context DETR
3.2.1. Transformer Encoder
3.2.2. Transformer Decoder
3.3. Coordinate Dilated Convolution Module
3.4. Loss Function Design
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparison with State-of-the-Art Methods
4.5. Visualizations
4.6. Ablation Studies
4.6.1. Effect of N
4.6.2. Effect of HCDETR
4.6.3. Effect of CDCM
4.6.4. Combined Effects of HCDETR and CDCM
4.6.5. Effect of Hyperparameters
4.7. Complexity and Efficiency Analysis
4.8. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Deng, L.; Zhou, Q.; Wang, S.; Górriz, J.M.; Zhang, Y. Deep learning in crowd counting: A survey. CAAI Trans. Intell. Technol. 2023, 1–35, early view status. [Google Scholar]
- Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Savner, S.S.; Kanhangad, V. CrowdFormer: Weakly-supervised crowd counting with improved generalizability. J. Vis. Commun. Image Represent. 2023, 94, 103853. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 15. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; Hong, X. Boosting crowd counting via multifaceted attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 19628–19637. [Google Scholar]
- Yang, S.; Guo, W.; Ren, Y. CrowdFormer: An overlap patching vision transformer for top-down crowd counting. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 23–29. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
- Liu, Y.; Shi, M.; Zhao, Q.; Wang, X. Point in, box out: Beyond counting persons in crowds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6469–6478. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Pham, V.Q.; Kozakaya, T.; Yamaguchi, O.; Okada, R. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 3–17 December 2015; pp. 3253–3261. [Google Scholar]
- Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Babu Sam, D.; Surya, S.; Venkatesh Babu, R. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5744–5752. [Google Scholar]
- Zeng, L.; Xu, X.; Cai, B.; Qiu, S.; Zhang, T. Multi-scale convolutional neural networks for crowd counting. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 465–469. [Google Scholar]
- Shu, W.; Wan, J.; Tan, K.C.; Kwong, S.; Chan, A.B. Crowd counting in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19618–19627. [Google Scholar]
- Liu, W.; Salzmann, M.; Fua, P. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5099–5108. [Google Scholar]
- Yan, Z.; Yuan, Y.; Zuo, W.; Tan, X.; Wang, Y.; Wen, S.; Ding, E. Perspective-guided convolution networks for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 952–961. [Google Scholar]
- Sun, G.; Liu, Y.; Probst, T.; Paudel, D.P.; Popovic, N.; Van Gool, L. Boosting crowd counting with transformers. arXiv 2021, arXiv:2105.10926. [Google Scholar]
- Du, Z.; Shi, M.; Deng, J.; Zafeiriou, S. Redesigning multi-scale neural network for crowd counting. IEEE Trans. Image Process. 2023, 32, 3664–3678. [Google Scholar] [CrossRef] [PubMed]
- Tian, Y.; Chu, X.; Wang, H. Cctrans: Simplifying and improving crowd counting with transformer. arXiv 2021, arXiv:2109.14483. [Google Scholar]
- Liang, D.; Xu, W.; Bai, X. An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 38–54. [Google Scholar]
- Liang, D.; Chen, X.; Xu, W.; Zhou, Y.; Bai, X. Transcrowd: Weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 2022, 65, 160104. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
- Fang, Y.; Li, Y.; Tu, X.; Tan, T.; Wang, X. Face completion with hybrid dilated convolution. Signal Process. Image Commun. 2020, 80, 115664. [Google Scholar] [CrossRef]
- Wang, B.; Liu, H.; Samaras, D.; Nguyen, M.H. Distribution matching for crowd counting. Adv. Neural Inf. Process. Syst. 2020, 33, 1595–1607. [Google Scholar]
- Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3365–3374. [Google Scholar]
- Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
- Wang, Q.; Gao, J.; Lin, W.; Li, X. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2141–2149. [Google Scholar] [CrossRef] [PubMed]
- Xu, C.; Qiu, K.; Fu, J.; Bai, S.; Xu, Y.; Bai, X. Learn to scale: Generating multipolar normalized density maps for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8382–8390. [Google Scholar]
- Lei, Y.; Liu, Y.; Zhang, P.; Liu, L. Towards using count-level weak supervision for crowd counting. Pattern Recognit. 2021, 109, 107616. [Google Scholar] [CrossRef]
- Hu, Y.; Jiang, X.; Liu, X.; Zhang, B.; Han, J.; Cao, X.; Doermann, D. Nas-count: Counting-by-density with neural architecture search. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 747–766. [Google Scholar]
- Wan, J.; Wang, Q.; Chan, A.B. Kernel-based density map generation for dense object counting. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1357–1370. [Google Scholar] [CrossRef] [PubMed]
- Liu, L.; Qiu, Z.; Li, G.; Liu, S.; Ouyang, W.; Lin, L. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1774–1783. [Google Scholar]
- Liu, L.; Lu, H.; Zou, H.; Xiong, H.; Cao, Z.; Shen, C. Weighing counts: Sequential crowd counting by reinforcement learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 164–181. [Google Scholar]
- Liu, X.; Yang, J.; Ding, W.; Wang, T.; Wang, Z.; Xiong, J. Adaptive mixture regression network with local counting map for crowd counting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 241–257. [Google Scholar]
- Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8198–8207. [Google Scholar]
Method | Position | Part A | Part B | QNRF | |||
---|---|---|---|---|---|---|---|
MAE | MSE | MAE | MSE | MAE | MSE | ||
L2SM [33] | × | 64.2 | 98.4 | 7.2 | 11.1 | 104.7 | 173.6 |
TransCrowd [25] | × | 66.1 | 105.1 | 9.3 | 16.1 | 97.2 | 168.5 |
MATT [34] | × | 80.1 | 129.4 | 11.7 | 17.5 | - | - |
AMSNet [35] | × | 56.7 | 93.4 | 6.7 | 10.2 | 101.8 | 163.2 |
KDMG [36] | × | 63.8 | 99.2 | 7.8 | 12.7 | 99.5 | 173.0 |
DSSI-Net [37] | ✓ | 60.6 | 96.0 | 6.8 | 10.3 | 99.1 | 159.2 |
LibraNet [38] | ✓ | 55.9 | 97.1 | 7.3 | 11.3 | 88.1 | 143.7 |
AMRNet [39] | ✓ | 61.5 | 98.3 | 7.0 | 11.0 | 86.6 | 152.2 |
DM-Count[29] | ✓ | 59.7 | 95.7 | 7.4 | 11.8 | 85.6 | 148.3 |
BCCT [21] | ✓ | 53.1 | 82.2 | 7.3 | 11.3 | 83.3 | 143.4 |
P2PNet [30] | ✓ | 52.7 | 85.1 | 6.3 | 9.9 | 85.3 | 154.5 |
CCTrans [23] | ✓ | 52.3 | 84.9 | 6.2 | 9.9 | 82.8 | 142.3 |
CC-DETR (ours) | ✓ | 51.8 | 83.3 | 6.1 | 9.7 | 82.2 | 144.6 |
Method | Position | Validation Set | Test Set | |||
---|---|---|---|---|---|---|
MAE | MSE | MAE | MSE | NAE | ||
MCNN [2] | × | 218.5 | 218.5 | 232.5 | 714.6 | - |
CSRNet [14] | × | 104.8 | 433.4 | 121.3 | 387.8 | - |
SFCN [40] | × | 95.4 | 608.3 | 105.4 | 424.1 | - |
TransCrowd [25] | × | 88.4 | 400.5 | 117.7 | 451.0 | 0.244 |
BL | ✓ | 93.6 | 470.4 | 105.4 | 454.2 | 0.203 |
DM-Count [29] | ✓ | 70.5 | 357.6 | 88.4 | 388.6 | 0.169 |
BCCT [21] | ✓ | 53.0 | 170.3 | 82.0 | 366.9 | 0.164 |
P2PNet [30] | ✓ | - | - | 77.4 | 362.0 | - |
CC-DETR (ours) | ✓ | 41.80 | 110.37 | 75.76 | 344.17 | 0.150 |
Hybrid Context | MAE | MSE |
---|---|---|
× | 54.8 | 92.7 |
✓ | 51.8 | 83.3 |
Method | MAE | MSE |
---|---|---|
Coord | 54.4 | 91.5 |
Dilated Conv | 53.6 | 86.7 |
Coord + Dilated Conv | 51.8 | 83.3 |
HCDETR | CDCM | MAE | MSE |
---|---|---|---|
× | ✓ | 53.3 | 86.9 |
✓ | × | 53.8 | 88.6 |
✓ | ✓ | 51.8 | 83.3 |
HCDETR | CDCM | MAE | MSE |
---|---|---|---|
× | ✓ | 53.3 | 86.9 |
✓ | × | 53.8 | 88.6 |
✓ | ✓ | 51.8 | 83.3 |
Method | Backbone | Parameters | Epoch | Part A | Part B | ||
---|---|---|---|---|---|---|---|
MAE | MSE | MAE | MSE | ||||
CCTrans [23] | Twins-large | 104 M | 1500 | 52.3 | 84.9 | 6.2 | 9.9 |
CC-DETR (ours) | Twins-large | 154 M | 500 | 51.8 | 83.3 | 6.1 | 9.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gu, Y.; Zhang, T.; Hu, Y.; Nian, F. CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting. Mathematics 2024, 12, 1562. https://doi.org/10.3390/math12101562
Gu Y, Zhang T, Hu Y, Nian F. CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting. Mathematics. 2024; 12(10):1562. https://doi.org/10.3390/math12101562
Chicago/Turabian StyleGu, Yanhong, Tao Zhang, Yuxia Hu, and Fudong Nian. 2024. "CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting" Mathematics 12, no. 10: 1562. https://doi.org/10.3390/math12101562
APA StyleGu, Y., Zhang, T., Hu, Y., & Nian, F. (2024). CC-DETR: DETR with Hybrid Context and Multi-Scale Coordinate Convolution for Crowd Counting. Mathematics, 12(10), 1562. https://doi.org/10.3390/math12101562