Complex Indoor Human Detection with You Only Look Once: An Improved Network Designed for Human Detection in Complex Indoor Scenes
Abstract
:1. Introduction
- A new method for human detection through vision named CIHD-YOLO has been developed based on deep learning. The method is optimized and adapted based on the YOLOv8 architecture, which can significantly improve detection accuracy in complex indoor environments.
- Due to the lack of a dedicated dataset for indoor human detection, the HCIE dataset was created. The new dataset combines multiple dimensions, such as different camera angles, subtle differences in lighting, indoor obstacles, and diverse populations composed of different age groups, forming a comprehensive resource.
- The combination of spatial pyramid pooling and the efficient partial self-attention mechanism (SPPEPSA) allows the network to extract features at different scales and aggregate them locally, enhancing the model’s ability to capture critical information. This improves the model’s detection capability for human subjects at various scales.
- An optimized network architecture GSEAM was proposed, which compensates for the losses caused by occlusion and illumination level differences by combining depth-wise separable convolutions and residual connections, enabling it to extract effective features from visual data with poor illumination levels in indoor environments.
2. Related Works
3. Methods
3.1. CIHD-YOLO Network
3.2. Spatial Pyramid Pooling with Effective Partial Self-Attention (SPPEPSA)
3.3. Generalized Separated and Enhancement Aggregation Network (GSEAM)
3.4. Global Spatial and Channel Reconstruction Convolution (GSCConv)
4. Experimental Evaluation
4.1. HCIE Dataset
- Diversity: For data diversity, the dataset encompasses diverse lighting conditions, shooting angles, backgrounds, human body sizes, and other variations within indoor scenes. This approach enhances the model’s generalization capability.
- Class Balance: While focusing solely on the human class, the dataset incorporates a range of factors like diverse ages, genders, body types, and poses of human subjects in indoor settings (standing, sitting, lying down, etc.). This approach enables the model to learn features from a spectrum of categories.
- Similar targets: The dataset includes both the human body to be detected and objects that are partially similar to the human body that are not intended to be detected, but when labeled, only the target to be detected is labeled.
- Practicality: Ensure the quality of the collected data, preferably with a size close to the actual usage scenario.
- Data integrity: To ensure consistency and completeness between images and annotations in the dataset, and to avoid missing or inconsistent data, LabelImg is used to label all data in this dataset.
4.2. Experimental Process
4.3. Evaluation Criteria
4.4. Experimental Results and Analysis
4.4.1. Ablation Experiment
4.4.2. Comparison Experiment
4.5. Visual Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vijayan, R.; Mareeswari, V.; Pople, V. Public Social Distance Monitoring System Using Object Detection YOLO Deep Learning Algorithm. SN Comput. Sci. 2023, 4, 718. [Google Scholar]
- Ganagavalli, K.; Santhi, V. YOLO-Based Anomaly Activity Detection System for Human Behavior Analysis and Crime Mitigation. Signal Image Video Process. 2024, 18 (Suppl. 1), 417–427. [Google Scholar] [CrossRef]
- Dalal, S.; Lilhore, U.K.; Sharma, N.; Arora, S.; Simaiya, S.; Ayadi, M.; Almujally, N.A.; Ksibi, A. Improving Smart Home Surveillance through YOLO Model with Transfer Learning and Quantization for Enhanced Accuracy and Efficiency. PeerJ Comput. Sci. 2024, 10, e1939. [Google Scholar] [CrossRef]
- Zhi-Xian, Z.; Zhang, F. Image Real-Time Detection Using LSE-Yolo Neural Network in Artificial Intelligence-Based Internet of Things for Smart Cities and Smart Homes. Wirel. Commun. Mob. Comput. 2022, 2022, 1–8. [Google Scholar] [CrossRef]
- Chua, S.N.D.; Chin, K.Y.R.; Lim, S.F.; Jain, P. Hand Gesture Control for Human–Computer Interaction with Deep Learning. J. Electr. Eng. Technol. 2022, 17, 1961–1970. [Google Scholar] [CrossRef]
- Alruwaili, M.; Siddiqi, M.H.; Atta, M.N.; Arif, M. Deep Learning and Ubiquitous Systems for Disabled People Detection Using YOLO Models. Comput. Hum. Behav. 2024, 154, 108150. [Google Scholar] [CrossRef]
- Inturi, A.R.; Manikandan, V.M.; Garrapally, V. A Novel Vision-Based Fall Detection Scheme Using Keypoints of Human Skeleton with Long Short-Term Memory Network. Arab. J. Sci. Eng. 2022, 48, 1143–1155. [Google Scholar] [CrossRef]
- Lafuente-Arroyo, S.; Martín-Martín, P.; Iglesias-Iglesias, C.; Maldonado-Bascón, S.; Acevedo-Rodríguez, F.J. RGB Camera-Based Fallen Person Detection System Embedded on a Mobile Platform. Expert Syst. Appl. 2022, 197, 116715. [Google Scholar] [CrossRef]
- Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
- Zi, X.; Chaturvedi, K.; Braytee, A.; Li, J.; Prasad, M. Detecting Human Falls in Poor Lighting: Object Detection and Tracking Approach for Indoor Safety. Electronics 2023, 12, 1259. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-To-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar]
- Kaur, R.; Singh, S. A Comprehensive Review of Object Detection with Deep Learning. Digit. Signal Process. 2022, 132, 103812. [Google Scholar] [CrossRef]
- Lezzar, F.; Benmerzoug, D.; Kitouni, I. Camera-Based Fall Detection System for the Elderly with Occlusion Recognition. Appl. Med. Inform. 2020, 42, 169–179. [Google Scholar]
- Aslan, M.F.; Durdu, A.; Sabanci, K.; Mutluer, M.A. CNN and HOG Based Comparison Study for Complete Occlusion Handling in Human Tracking. Measurement 2020, 158, 107704. [Google Scholar] [CrossRef]
- Manakitsa, N.; Maraslidis, G.S.; Moysis, L.; Fragulis, G.F. A Review of Machine Learning and Deep Learning for Object Detection, Semantic Segmentation, and Human Action Recognition in Machine and Robotic Vision. Technologies 2024, 12, 15. [Google Scholar] [CrossRef]
- Schwartz, W.R.; Kembhavi, A.; Harwood, D.; Davis, L.S. Human detection using partial least squares analysis. In Proceedings of the IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 November 2009; pp. 24–31. [Google Scholar]
- Ahmed, I.; Ahmad, M.; Adnan, A.; Ahmad, A.; Khan, M. Person Detector for Different Overhead Views Using Machine Learning. Int. J. Mach. Learn. Cybern. 2019, 10, 2657–2668. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-The-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Aoki, Y.; Kobayashi, N.; Okoshi, T.; Nakazawa, J. Demo: Image-Based Indoor Localization Using Object Detection and LSTM. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, Tokyo, Japan, 3–7 June 2024; pp. 596–597. [Google Scholar]
- Safaldin, M.; Zaghden, N.; Mejdoub, M. An Improved YOLOv8 to Detect Moving Objects. IEEE Access 2024, 12, 59782–59806. [Google Scholar] [CrossRef]
- Han, L.; Feng, H.; Liu, G.; Zhang, A.; Han, T. A Real-Time Intelligent Monitoring Method for Indoor Evacuee Distribution Based on Deep Learning and Spatial Division. J. Build. Eng. 2024, 92, 109764. [Google Scholar] [CrossRef]
- Kan, X.; Zhu, S.; Zhang, Y.; Qian, C. A Lightweight Human Fall Detection Network. Sensors 2024, 24, 922. [Google Scholar] [CrossRef] [PubMed]
- Cao, Y.; Li, C.; Peng, Y.; Ru, H. MCS-YOLO: A Multiscale Object Detection Method for Autonomous Driving Road Environment Recognition. IEEE Access 2023, 11, 22342–22354. [Google Scholar] [CrossRef]
- Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Volume 8693, pp. 740–755. [Google Scholar]
- Everingham, M.; van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Parameter | Value |
---|---|
Initial Learning Rate | 0.02 |
Epochs | 100 |
Batch Size | 20 |
Imgsz | 640 |
Optimizer | SGD |
Weight Decay | 0.0005 |
Momentum | 0.937 |
Models | SPPEPSA | GSEAM | GSCConv | mAP50 | mAP50-95 | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|---|
YOLOv8s | × | × | × | 85.77 | 61.52 | 11.17 | 28.8 |
YOLOv8s_1 | √ | × | × | 86.15 | 62.57 | 12.16 | 29.6 |
YOLOv8s_2 | × | √ | × | 86.23 | 63.05 | 10.17 | 26.9 |
YOLOv8s_3 | × | × | √ | 86.14 | 63.07 | 11.25 | 28.9 |
YOLOv8s_4 | √ | √ | × | 86.42 | 63.38 | 11.16 | 27.7 |
YOLOv8s_5 | √ | × | √ | 86.39 | 63.50 | 12.24 | 29.7 |
YOLOv8s_6 | × | √ | √ | 86.52 | 63.77 | 10.25 | 27.0 |
CIHD-YOLO | √ | √ | √ | 86.76 | 64.19 | 11.24 | 27.8 |
Models | mAP50 | mAP50-95 | Model Size (MB) | Params (M) | FLOPs (G) |
---|---|---|---|---|---|
YOLOv5s | 82.23 | 55.89 | 14.4 | 7.01 | 15.8 |
YOLOv5sp6 | 84.59 | 59.06 | 25.1 | 12.32 | 16.3 |
YOLOv6s | 86.26 | 63.15 | 32.8 | 16.30 | 44.0 |
YOLOv8s | 85.77 | 61.52 | 22.5 | 11.17 | 28.8 |
YOLOv9s | 84.27 | 60.89 | 15.2 | 7.29 | 27.4 |
YOLOv10s | 84.10 | 61.46 | 16.5 | 8.04 | 24.4 |
CenterNet | 72.20 | 44.00 | 124.9 | 32.67 | 70.2 |
EfficientDet | 59.90 | 38.40 | 15.1 | 3.87 | 5.2 |
RT-DETR-L | 82.65 | 59.26 | 66.2 | 32.81 | 108.0 |
CIHD-YOLO | 86.76 | 64.19 | 22.8 | 11.24 | 27.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, Y.; Fu, Y. Complex Indoor Human Detection with You Only Look Once: An Improved Network Designed for Human Detection in Complex Indoor Scenes. Appl. Sci. 2024, 14, 10713. https://doi.org/10.3390/app142210713
Xu Y, Fu Y. Complex Indoor Human Detection with You Only Look Once: An Improved Network Designed for Human Detection in Complex Indoor Scenes. Applied Sciences. 2024; 14(22):10713. https://doi.org/10.3390/app142210713
Chicago/Turabian StyleXu, Yufeng, and Yan Fu. 2024. "Complex Indoor Human Detection with You Only Look Once: An Improved Network Designed for Human Detection in Complex Indoor Scenes" Applied Sciences 14, no. 22: 10713. https://doi.org/10.3390/app142210713
APA StyleXu, Y., & Fu, Y. (2024). Complex Indoor Human Detection with You Only Look Once: An Improved Network Designed for Human Detection in Complex Indoor Scenes. Applied Sciences, 14(22), 10713. https://doi.org/10.3390/app142210713