A Lightweight Face Detector via Bi-Stream Convolutional Neural Network and Vision Transformer
Abstract
:1. Introduction
- 1.
- We have introduced a novel backbone architecture for efficient face detection, which has leveraged the advantages of both CNNs and transformers and outputs multiscale features through a hybrid backbone to detect faces with scale variations.
- 2.
- The proposed FEC block employs a spatial dimension reconstruction operation and standard convolutional stacks to optimize the preservation of detailed facial textures while facilitating feature fusion between the transformer and CNN blocks.
- 3.
- By combining the standard convolution layers and a branch channel attention architecture, our proposed MFA module is able to enhances the ability of the detector to differentiate between facial features and background elements along both the spatial and channel dimensions.
2. Related Works
2.1. CNN-Based Face Detectors
2.2. Transformer-Based Vision Tasks
3. Proposed Method
3.1. Method Overview
3.2. Hybrid Backbone
3.3. Feature Enhancement Convolution Block
3.4. Multiscale Feature Aggregation Module
4. Experimental Results
4.1. Datasets and Evaluation Metrics
4.2. Implementation Details
4.3. Component Evaluation
4.3.1. Ablation Study on the Hybrid Backbone
4.3.2. Ablation Study on the FFC
4.3.3. Ablation Study on the MFA
4.4. Comparison with the SOTA Methods
4.5. Running Efficiency
5. Limitation and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Notations
References
- Zhang, S.; Zhu, R.; Wang, X.; Shi, H.; Fu, T.; Wang, S.; Mei, T.; Li, S. Improved selective refinement network for face detection. arXiv 2019, arXiv:1901.06651. [Google Scholar]
- Kuzdeuov, A.; Koishigarina, D.; Varol, H.A. Anyface: A data-centric approach for input-agnostic face detection. In Proceedings of the 2023 IEEE International Conference on Big Data and Smart Computing(BigComp), Jeju, Republic of Korea, 13–16 February 2023; pp. 211–218. [Google Scholar]
- Howard, A.G.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Wang, H.; Li, Z.; Ji, X.; Wang, Y. Face r-cnn. arXiv 2017, arXiv:1706.01061. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask r-cnn. arXiv 2017, arXiv:1703.06870. [Google Scholar]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, F.; Sun, B.; Li, H. Mogface: Towards a deeper appreciation on face detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4083–4092. [Google Scholar]
- Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5202–5211. [Google Scholar]
- Zhang, F.; Fan, X.; Ai, G.; Song, J.; Qin, Y.; Wu, J. Accurate face detection for high performance. arXiv 2019, arXiv:1905.01585. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
- Wang, L.; Koniusz, P. 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5620–5631. [Google Scholar]
- Li, Y.; Yu, Z.; Choy, C.B.; Xiao, C.; Álvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9087–9098. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional positional encodings for vision transformers. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying convolution and attention for all data sizes. arXiv 2021, arXiv:2106.04803. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
- Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the integration of self-attention and convolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 805–815. [Google Scholar]
- Zhang, H.; Hu, W.; Wang, X. Parc-net: Position aware circular convolution with merits from convnets and transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 7–10 December 2021. [Google Scholar]
- Liu, J.; Li, H.; Kong, W. Multi-level learning counting via pyramid vision transformer and cnn. Eng. Appl. Artif. Intell. 2023, 123, 106184. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S. Cbam: Convolutional block attention module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
- Yang, S.; Xiong, Y.; Loy, C.C.; Tang, X. Face detection through scale-friendly deep convolutional networks. arXiv 2017, arXiv:1706.02863. [Google Scholar]
- Zhang, C.; Xu, X.; Tu, D. Face detection using improved faster rcnn. arXiv 2018, arXiv:1802.02142. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J.; Yang, J.; Wang, C.; Li, J.; Huang, F. Dsfd: Dual shot face detector. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5055–5064. [Google Scholar]
- He, Y.; Xu, D.; Wu, L.; Jian, M.; Xiang, S.; Pan, C. Lffd: A light and fast face detector for edge devices. arXiv 2019, arXiv:1904.10633. [Google Scholar]
- Qi, D.; Tan, W.; Yao, Q.; Liu, J. Yolo5face: Why reinventing a face detector. In Computer Vision–ECCV 2022 Workshops. ECCV 2022; Springer: Cham, Switzerland, 2021. [Google Scholar]
- Wang, G.Q.; Li, J.Y.; Wu, Z.; Xu, J.; Shen, J.; Yang, W. Efficientface: An efficient deep network with feature enhancement for accurate face detection. arXiv 2023, arXiv:2302.11816. [Google Scholar] [CrossRef]
- Yoo, Y.J.; Han, D.; Yun, S. Extd: Extremely tiny face detector via iterative filter reuse. arXiv 2019, arXiv:1906.06579. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar]
- Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
- Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.-L.; Lin, H.; Sun, Y.; He, T.; Mueller, J.W.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 2735–2745. [Google Scholar]
- Yang, S.; Luo, P.; Loy, C.C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
- Jain, V.; Learned-Miller, E.G. Fddb: A Benchmark for Face Detection in Unconstrained Settings; UMass Amherst: Amherst, MA, USA, 2010. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
- Xu, D.; Wu, L.; He, Y.; Zhao, Q.; Jian, M.; Yan, J.; Zhao, L. Os-lffd: A light and fast face detector with ommateum structure. Multimed. Tools Appl. 2020, 80, 34153–34172. [Google Scholar] [CrossRef]
- Guo, J.; Deng, J.; Lattas, A.; Zafeiriou, S. Sample and computation redistribution for efficient face detection. arXiv 2021, arXiv:2105.04714. [Google Scholar]
- Jiang, C.; Ma, H.; Li, L. Irnet: An improved retinanet model for face detection. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 129–134. [Google Scholar]
- Zhu, Y.; Cai, H.; Zhang, S.; Wang, C.; Xiong, Y. Tinaface: Strong but simple baseline for face detection. arXiv 2020, arXiv:2011.13183. [Google Scholar]
Backbones | Easy | Medium | Hard | Params (M) |
---|---|---|---|---|
MobileNet V3 [3] | 93.75% | 91.48% | 81.29% | 3.67 |
EfficientNet-B0 [41] | 94.27% | 92.59% | 83.82% | 4.77 |
Ours | 95.30% | 94.20% | 87.56% | 3.80 |
Model | Easy | Medium | Hard |
---|---|---|---|
Baseline | 93.78% | 92.70% | 86.87% |
Baseline + FEC (w Conv) | 94.13% | 93.09% | 87.03% |
Baseline + FEC (w DP) | 95.30% | 94.20% | 87.56% |
Dataset | w/o MFA | w MFA |
---|---|---|
Easy | 93.70% | 95.30% |
Medium | 92.69% | 94.20% |
Hard | 84.82% | 87.56% |
Model | Easy | Medium | Hard |
---|---|---|---|
E-CT Face + Avgpool | 94.77% | 93.87% | 87.28% |
E-CT Face + Maxpool | 95.14% | 93.96% | 86.65% |
E-CT Face + Avgpool & Maxpool | 95.30% | 94.20% | 87.56% |
Light Detector | Easy | Medium | Hard | Params (M) |
---|---|---|---|---|
YoloV5 Face-n [31] | 93.6% | 91.5% | 80.5% | 1.72 |
YoloV5 Face-s [31] | 94.3% | 92.6% | 83.1% | 7.06 |
EXTD [33] | 92.1% | 91.1% | 85.6% | 0.16 |
LFFD [30] | 91.0% | 88.1% | 78.0% | 2.15 |
OS-LFFD [42] | 91.6% | 88.4% | 77.1% | 1.44 |
Efficient Face-B0 [32] | 91.0% | 89.1% | 83.6% | 3.94 |
Efficient Face-B1 [32] | 91.9% | 90.2% | 85.1% | 6.64 |
Efficient Face-B2 [32] | 92.5% | 91.0% | 86.3% | 7.98 |
SCRFD-10GF [43] | 95.1% | 93.8% | 83.0% | 3.86 |
IRNet [44] | 91.8% | 89.3% | 76.6% | 1.68 |
Ours | 95.30% | 94.20% | 87.56% | 3.80 |
Heavy Detector | Easy | Medium | Hard | Params (M) |
---|---|---|---|---|
AInnoFace [9] | 97.0% | 96.1% | 91.8% | 88.01 |
MogFaceAli-AMS [7] | 94.6% | 93.6% | 87.3% | 36.07 |
MogFace [7] | 97.0% | 94.3% | 93.0% | 85.26 |
TinaFace [45] | 95.6% | 94.2% | 81.4% | 172.95 |
YoloV5 Face-X6 [31] | 96.67% | 95.08% | 86.55% | 88.665 |
Ours | 95.30% | 94.20% | 87.56% | 3.8 |
Platforms | 320 × 320 | 640 × 640 | 960 × 960 |
---|---|---|---|
NVIDIA GeForce 2080Ti | 9 ms (111 FPS) | 13 ms (76 FPS) | 28 ms (35 FPS) |
CPU | 35 ms (28 FPS) | 114 ms (8 FPS) | ---- ---- |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Z.; Chao, Q.; Wang, S.; Yu, T. A Lightweight Face Detector via Bi-Stream Convolutional Neural Network and Vision Transformer. Information 2024, 15, 290. https://doi.org/10.3390/info15050290
Zhang Z, Chao Q, Wang S, Yu T. A Lightweight Face Detector via Bi-Stream Convolutional Neural Network and Vision Transformer. Information. 2024; 15(5):290. https://doi.org/10.3390/info15050290
Chicago/Turabian StyleZhang, Zekun, Qingqing Chao, Shijie Wang, and Teng Yu. 2024. "A Lightweight Face Detector via Bi-Stream Convolutional Neural Network and Vision Transformer" Information 15, no. 5: 290. https://doi.org/10.3390/info15050290
APA StyleZhang, Z., Chao, Q., Wang, S., & Yu, T. (2024). A Lightweight Face Detector via Bi-Stream Convolutional Neural Network and Vision Transformer. Information, 15(5), 290. https://doi.org/10.3390/info15050290