YOLO-I3D: Optimizing Inflated 3D Models for Real-Time Human Activity Recognition
Abstract
:1. Introduction
Contributions
- We introduce the object detection model, You Look Only Once v5 (YOLOv5) [14], into the RGB branch of the original I3D [7] to construct YOLO-I3D as shown in Figure 1, improving both the efficiency and accuracy of video processing. This upgrade enhances the accuracy of the RGB-only single-branch I3D model by 1.42% on the Kinetics-400 dataset.
- To reduce the computational cost of the optical flow branch, we replace it with an RGB video processing pipeline to extract simpler motion information. This model, referred to as two-stream I3D Light (Figure 1), improves the original RGB-only I3D accuracy by 4.13% on the Kinetics-400 dataset.
- Finally, we combine these two enhancements—replacing the RGB branch with YOLOv5 and the optical flow branch with a light motion information processing branch—to create YOLO-I3D Light. The final model, shown in Figure 1, further improves the accuracy of YOLO-I3D by 0.41% on the Kinetics-400 dataset.
2. Related Work
2.1. Background: Video and Image Feature Extraction
2.1.1. Image Classification
2.1.2. Object Detection
2.2. Video-Based HAR
2.3. Datasets for HAR
3. Methodology
3.1. Stage 1: Two-Stream I3D Light
- In order to reduce the training time, we use the pretrained weights of the original model [7,49], and we use 32 RGB frames in the original RGB I3D224 branch instead of 64 frames as used in the published work (following other models such as T-C3D [9], F-E3D [50], and D3D [10]). We denote I3D224 as the original I3D RGB branch which uses 32 frames of images with a resolution of 224 × 224 as input (224 × 224 × 32 instead of 224 × 224 × 64) to extract the spatial features from the data. While reducing resolution can lead to potential information loss, we mitigate this by leveraging pretrained weights from the original I3D model. These weights retain robust spatial feature extraction capabilities learned at a higher resolution, ensuring that essential spatial features are preserved even at a lower resolution. Additionally, we balance spatial and temporal features by combining the lower-resolution motion information branch (112 × 112, 128 frames) with the higher resolution RGB stream (224 × 224, 32 frames), which allows the model to effectively capture both spatial detail and temporal dynamics.
- We implement a new I3D112 branch to replace the original OF branch in the two-stream I3D model [7]. We denote I3D112 as the motion information branch, which uses 128 frames of images of resolution 112 × 112 as the input for temporal information.
- As shown in Figure 1b, we combine the top 224 × 224 × 32 RGB I3D224 branch with the bottom 112 × 112 × 128 I3D112 branch to create our two-stream I3D Light model. With I3D112, the proposed I3D Light achieves a balance between maintaining accuracy and reducing computational costs, setting the stage for the subsequent enhancements detailed in the following sections.
- For studying comparative performance, we train the RGB branch of the original two-stream I3D model and our two-stream I3D Light model using the Kinetics400 dataset. To use the pretrained original I3D weights [7,49] and reduce the training time, we change the pooling size of the last average pooling layer from 7 × 7 to 4 × 4. This ensures that the size of the output feature map from the last average pooling layer matches the input size of that next layer in the I3D model.
3.2. Stage 2: YOLO-I3D Model
- To incorporate the YOLOv5 model in the I3D, we divide the data processing pipeline into the top and bottom parts for both the YOLO (at the bottom) and the I3D112 (top) models. We keep the I3D top part and replace the bottom part with the bottom part of the YOLO model.
- To make the data sizes compatible, we use an input image resolution of 224 × 224, which generates an output feature map of 512 × 14 × 14 from the YOLO bottom part for each image (512 is channel number and 14 × 14 is the spatial resolution). In order to simulate the effect of a stride size 2 in the first layer of the 3D CNN module in the I3D model, we use a downsampling-by-2 in the data feeding pipeline of YOLO-I3D. During training, the input to YOLO-I3D is a batch of videos (batch size is denoted by B in Figure 2). Each video is sampled to 32 frames of images, and the data shape of the stacked feature maps of YOLO part of YOLO-I3D is B × 16 × 512 × 14 × 14. In order to simulate the pooling layer of the bottom part of the I3D model, we pool the output from the YOLO part of YOLO-I3D. Therefore, the data shape becomes B × 8 × 512 × 14 × 14. In order to match the input data shape of the top part of the I3D model, we switch the order of frame number 8 and the channel dimension 512. So the final shape of the data fed to the top part of I3D model is B × 512 × 8 × 14 × 14.
- During training, the weights of the YOLO parts of YOLO-I3D are not modified, and only the weights in the top I3D part of YOLO-I3D are optimized. For studying comparative performance, we implement and train the original I3D model I3D224 and our YOLO-I3D model using large Kinetics400 dataset.
- We fine-tune the original I3D224 model and our YOLO-I3D model on the smaller HMDB51 dataset in order to check the performance of YOLO-I3D using transfer learning. During training, 32 frames of images of resolution 224 × 224 are used as input.
3.3. Stage 3: Two-Stream YOLO-I3D Light
- In this section, we put together the two extensions to further enhance the performance of the two-stream I3D Light model. Figure 1c shows the combined YOLO-I3D Light model. The YOLO-I3D model (explained in Section 3.2) is used in the I3D112 branch of the two-stream I3D Light model (explained in Section 3.1). This results in the new two-stream YOLO-I3D Light model containing the YOLO-I3D112 branch as shown in Figure 1d.
- The overall performance of the model is validated using the Kinetics-400 dataset against the original I3D model.
3.4. Model Training and Validation
4. Implementation
4.1. Experimental Environment
4.2. Data Preprocessing
4.2.1. Preprocessing Kinetics-400
4.2.2. Preprocessing HMDB51
4.3. Model Training Using Sub-Epoch and Sub-Dataset
- Algorithm: Sub-epoch runner
- 1.
- Step 1: Loading and preprocessing data.
- Load training dataset and validation dataset.
- Divide the whole training dataset into N (such as 10) subsets randomly with equal probability. Keep the validation dataset undivided.
- Create N sub-data-loaders corresponding to the N sub-datasets for training and one data loader for the validation dataset.
- 2.
- Step 2: Training.
- Divide each epoch into N sub-epochs. Each sub-epoch performs training on the corresponding sub-dataset and a validation on the whole validation dataset.
- If the validation accuracy does not improve after a specified number of training epochs, the learning rate is decreased based on a given schedule, and the model weights are set to the values providing the best validation accuracy until that time point.
- 3.
- Step 3: Stop the training if the epoch number reaches a specified value or the early stop condition is satisfied. Otherwise, go back to Step 2.
- We divide each epoch into N sub-epochs in all the experiments in this section. The learning rate is decreased in the sub-epoch runner when there is no improvement in the validation accuracy after N sub-epochs (i.e., one entire epoch).
4.4. Model Training
5. Validation
5.1. Experiment 1: Two-Stream I3D Light
5.1.1. Model Accuracy
5.1.2. Effectiveness of Sub-Epoch Training
5.1.3. Computational Cost
5.1.4. Discussion
5.2. Experiment 2: YOLO-I3D
5.2.1. Model Performance
5.2.2. Discussion
5.3. Experiment-3: Two-Stream YOLO-I3D Light
5.3.1. Model Accuracy
5.3.2. Execution Time
5.3.3. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Adel, B.; Badran, A.; Elshami, N.E.; Salah, A.; Fathalla, A.; Bekhit, M. A Survey on Deep Learning Architectures in Human Activities Recognition Application in Sports Science, Healthcare, and Security. In Proceedings of the The International Conference on Innovations in Computing Research, Athens, Greece, 29–31 August 2022; Springer International Publishing: New York, NY, USA, 2022; pp. 121–134. [Google Scholar]
- Gómez, S.; Mejía, D.; Tobón, S. The deterrent effect of surveillance cameras on crime. J. Policy Anal. Manag. 2021, 40, 553–571. [Google Scholar] [CrossRef]
- Mu, X.; Zhang, X.; Osivue, O.R.; Han, H.; khaled Kadry, H.; Wang, Y. Dynamic modeling and control method of walking posture of multifunctional elderly-assistant and walking-assistant robot for preventing elderly fall. In Proceedings of the 2018 International Conference on Sensing, Diagnostics, Prognostics, and Control (SDPC), Xi’an, China, 15–17 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 806–809. [Google Scholar]
- Tarek, O.; Magdy, O.; Atia, A. Yoga Trainer for Beginners Via Machine Learning. In Proceedings of the 2021 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), Virtual, 13–14 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 75–78. [Google Scholar]
- Dang, L.M.; Min, K.; Wang, H.; Piran, M.J.; Lee, C.H.; Moon, H. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognit. 2020, 108, 107561. [Google Scholar] [CrossRef]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
- Liu, K.; Liu, W.; Ma, H.; Tan, M.; Gan, C. A real-time action representation with temporal encoding and deep compression. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 647–660. [Google Scholar] [CrossRef]
- Jiang, S.; Qi, Y.; Zhang, H.; Bai, Z.; Lu, X.; Wang, P. D3d: Dual 3-d convolutional network for real-time action recognition. IEEE Trans. Ind. Inform. 2020, 17, 4584–4593. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; ChristopherSTAN. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 10 November 2022).
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Tao, H.; Duan, Q. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition. Expert Syst. Appl. 2023, 215, 119371. [Google Scholar] [CrossRef]
- Tao, H. A label-relevance multi-direction interaction network with enhanced deformable convolution for forest smoke recognition. Expert Syst. Appl. 2024, 236, 121383. [Google Scholar] [CrossRef]
- Tao, H.; Duan, Q.; Lu, M.; Hu, Z. Learning discriminative feature representation with pixel-level supervision for forest smoke recognition. Pattern Recognit. 2023, 143, 109761. [Google Scholar] [CrossRef]
- Guo, X.; Zhang, X.; Li, L.; Xia, Z. Micro-expression spotting with multi-scale local transformer in long videos. Pattern Recognit. Lett. 2023, 168, 146–152. [Google Scholar] [CrossRef]
- Liu, P.; Wang, F.; Li, K.; Chen, G.; Wei, Y.; Tang, S.; Wu, Z.; Guo, D. Micro-gesture Online Recognition using Learnable Query Points. arXiv 2024, arXiv:2407.04490. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Deng, J.; Xuan, X.; Wang, W.; Li, Z.; Yao, H.; Wang, Z. A review of research on object detection based on deep learning. J. Phys. Conf. Ser. 2020, 1684, 012028. [Google Scholar] [CrossRef]
- Yang, G.; Feng, W.; Jin, J.; Lei, Q.; Li, X.; Gui, G.; Wang, W. Face mask recognition system with YOLOV5 based on image recognition. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1398–1404. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Beauchemin, S.S.; Barron, J.L. The computation of optical flow. ACM Comput. Surv. (CSUR) 1995, 27, 433–466. [Google Scholar] [CrossRef]
- Zhao, K.; Zhang, K.; Zhai, Y.; Wang, D.; Su, J. Real-time sign language recognition based on video stream. Int. J. Syst. Control Commun. 2021, 12, 158–174. [Google Scholar] [CrossRef]
- Chen, C.; Liu, M.; Liu, H.; Zhang, B.; Han, J.; Kehtarnavaz, N. Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition. IEEE Access 2017, 5, 22590–22604. [Google Scholar] [CrossRef]
- Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, 29 June–2 July 2003 Proceedings 13; Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
- OpenCV Team. Optical Flow. Available online: https://docs.opencv.org/4.x/d4/dee/tutorial_optical_flow.html (accessed on 15 January 2023).
- Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
- Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer Internation Publishing: New York, NY, USA, 2020; pp. 402–419. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Washington, DC, USA, 23–26 August 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 32–36. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Sigurdsson, G.A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; Gupta, A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: New York, NY, USA, 2016; pp. 510–526. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar]
- Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A short note about kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]
- Carreira, J.; Noland, E.; Hillier, C.; Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv 2019, arXiv:1907.06987. [Google Scholar]
- Miracleyoo. Re-Trainable I3D Models Transferred from TensorFlow to PyTorch. Available online: https://github.com/miracleyoo/Trainable-i3d-pytorch (accessed on 8 November 2022).
- Fan, H.; Luo, C.; Zeng, C.; Ferianc, M.; Que, Z.; Liu, S.; Niu, X.; Luk, W. F-E3D: FPGA-based acceleration of an efficient 3D convolutional neural network for human action recognition. In Proceedings of the 2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and PROCESSORS (ASAP), New York, NY, USA, 15–17 July 2019; IEEE: Piscataway, NJ, USA, 2019; Volume 2160, pp. 1–8. [Google Scholar]
- Foundation, P. PyTorch 2.0 NOW AVAILABLE. Available online: https://pytorch.org/ (accessed on 15 February 2023).
- DeepMind. Kinetics400 Dataset. Available online: https://academictorrents.com/details/184d11318372f70018cf9a72ef867e2fb9ce1d26 (accessed on 27 December 2022).
- DeepMind. I3D Models Trained on Kinetics. Available online: https://github.com/deepmind/kinetics-i3d (accessed on 28 December 2022).
- Foundation, P. CROSSENTROPYLOSS. Available online: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html (accessed on 28 December 2022).
- Zafar, I.; Tzanidou, G.; Burton, R.; Patel, N.; Araujo, L. Hands-on Convolutional Neural Networks with TensorFlow: Solve Computer Vision Problems with Modeling in TensorFlow and Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
- Yep, T. Torchinfo. Available online: https://github.com/TylerYep/torchinfo (accessed on 10 December 2022).
- Xu, J.; Song, R.; Wei, H.; Guo, J.; Zhou, Y.; Huang, X. A fast human action recognition network based on spatio-temporal features. Neurocomputing 2021, 441, 350–358. [Google Scholar] [CrossRef]
- Luo, R.; Rivest, F. I3D Light-A Simple Motion Information Stream for I3D; InCanadian AI: Toronto, ON, Canada, 2023. [Google Scholar]
Dataset | Year | Frame Rate | Number of Actions | Number of Videos | Average Clip Length | Video Resolution |
---|---|---|---|---|---|---|
KTH [42] | 2004 | 25 FPS | 6 | 2391 | 4 s | 160 × 120 |
HMDB51 [12] | 2011 | 30 FPS | 51 | 6849 | 2.5 s | 340 × 256 * |
UCF101 [13] | 2012 | 25 FPS | 101 | 13,320 | 7.2 s | 320 × 240 |
Sports-1M [43] | 2014 | variable | 487 | 1,000,000 | 5 m 36 s | variable |
Charades [44] | 2015 | 24 FPS | 157 | 9848 | 30 s | 640 × 480 |
NTU RGB+D [45] | 2016 | 30 FPS | 60 | 56,880 | 8.6 s | 512 × 424 |
miniKinetics [7] | 2017 | variable | 213 | 120,000 | 10 s | variable |
Kinetics-400 [11] | 2017 | variable | 400 | 306,245 | 10 s | variable |
Something-Something V1 [46] | 2017 | 24 FPS | 174 | 108,499 | 4.0 s | 84 × 84 |
Something-Something V2 [46] | 2018 | 24 FPS | 174 | 220,847 | 4.0 s | 84 × 84 |
Kinetics-600 [47] | 2018 | variable | 600 | 495,547 | 10 s | variable |
Kinetics-700 [48] | 2019 | variable | 700 | 650,317 | 10 s | variable |
NTU RGB+D 120 [41] | 2019 | 30 FPS | 120 | 114,480 | 8.5 s | 512 × 424 |
Model | Input | Training Accuracy | Validation Accuracy |
---|---|---|---|
I3D224 | 32 frames | 69.84% | 61.00% |
I3D112 | 128 frames | 65.89% | 60.02% |
I3D224+I3D112 (No combined tuning) | 32 frames + 128 frames | 80.15% | 64.47% |
I3D224+I3D112 (With combined tuning) | 32 frames + 128 frames | 75.51% | 65.09% |
Model | Input | Training Accuracy | Validation Accuracy |
---|---|---|---|
I3D224 only | 224 × 224 × 32 | 69.65% | 61.03% |
I3D112 only | 112 × 112 × 128 | 66.91% | 60.39% |
Two-stream I3D224 + I3D112 (With combined tuning) | 224 × 224 × 32 + 112 × 112 × 128 | 76.20% | 65.16% |
Original I3D RGB branch only on miniKinetics * [7] | 224 × 224 × 64 | N/A | 74.1% (test accuracy) |
Original two-stream I3D on miniKinetics * RGB branch+OF branch [8] | 224 × 224 × 64 + 224 × 224 × 64 | N/A | 78.7% (test accuracy) |
Model | Estimated Total Memory Size | Total Mult-Adds | Avg. Execution Time per Video | Training Time per Epoch | Validation Time per Epoch |
---|---|---|---|---|---|
I3D112 only | 12,490.13 MB | 944.80 G | 0.028 s | 3 h 24 m 3 s | 7 m 22 s |
I3D224 only | 12,528.13 MB | 949.73 G | 0.028 s | 3 h 48 m 43 s | 13 m 10 s |
Two-stream I3D224 + I3D112 [8] | 25,018.26 MB | 1894.53 G | 0.056 s | N/A | N/A |
OF (Farneback) only [36] | N/A | N/A | 0.638 s | N/A | N/A |
Two-stream I3D224 + OF (Farneback) [36] | N/A | N/A | 0.666 s | N/A | N/A |
OF (PWC-Net) only [38] | N/A | N/A | 0.574 s | N/A | N/A |
Two-stream I3D224 + OF (PWC-Net) [38] | N/A | N/A | 0.600 s | N/A | N/A |
OF (RAFT) only [39] | N/A | N/A | 0.390 s | N/A | N/A |
Two-stream I3D224 + OF (RAFT) [39] | N/A | N/A | 0.418s | N/A | N/A |
Model | Input | Training Accuracy | Validation Accuracy | Training Time per Epoch | Validation Time per Epoch |
---|---|---|---|---|---|
YOLO-I3D | 32 frames | 69.05% | 62.42% | 1 h 16 m 37 s | 5 m 02 s |
I3D224 | 32 frames | 69.84% | 61.00% | 3 h 24 m 03 s | 7 m 22 s |
Model | Input | Training Accuracy | Validation Accuracy |
---|---|---|---|
Xu et al. [57] (Baseline) | - | - | 67.9 |
YOLO-I3D | 32 frames | 84.64% | 70.98% |
I3D224 | 32 frames | 79.71% | 70.98% |
Model | Input | Training Accuracy | Validation Accuracy |
---|---|---|---|
YOLO-I3D112 | 128 frames | 71.52% | 61.46% |
I3D112 | 128 frames | 65.89% | 60.02% |
Two-stream YOLO-I3D Light I3D224 + YOLO-I3D112 | 32 frames + 128 frames | N/A | 65.57% |
Two-stream I3D Light I3D224 + I3D112 | 32 frames + 128 frames | 76.20% | 65.16% |
Model | Average Execution Time per Video | Training Time per Epoch | Validation Time per Epoch |
---|---|---|---|
YOLO-I3D112 | 0.024 s | 1 h 26 m 29 s | 6 m 30 s |
I3D112 | 0.028 s | 3 h 48 m 43 s | 13 m 10 s |
YOLO-I3D | 0.024 s | 1 h 16 m 37 s | 5 m 02 s |
I3D224 | 0.028 s | 3 h 24 m 03 s | 7 m 22 s |
Two-stream I3D Light | 0.056 s | 7 h 22 m 27 s | 16 m 14 s |
Two-stream YOLO-I3D Light | 0.052 s | 5 h 19 m 39 s | 13 m 34 s |
Optical flow | 0.638s | N/A | N/A |
Original two-stream I3D | 0.666 s | N/A | N/A |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Luo, R.; Anand, A.; Zulkernine, F.; Rivest, F. YOLO-I3D: Optimizing Inflated 3D Models for Real-Time Human Activity Recognition. J. Imaging 2024, 10, 269. https://doi.org/10.3390/jimaging10110269
Luo R, Anand A, Zulkernine F, Rivest F. YOLO-I3D: Optimizing Inflated 3D Models for Real-Time Human Activity Recognition. Journal of Imaging. 2024; 10(11):269. https://doi.org/10.3390/jimaging10110269
Chicago/Turabian StyleLuo, Ruikang, Aman Anand, Farhana Zulkernine, and Francois Rivest. 2024. "YOLO-I3D: Optimizing Inflated 3D Models for Real-Time Human Activity Recognition" Journal of Imaging 10, no. 11: 269. https://doi.org/10.3390/jimaging10110269
APA StyleLuo, R., Anand, A., Zulkernine, F., & Rivest, F. (2024). YOLO-I3D: Optimizing Inflated 3D Models for Real-Time Human Activity Recognition. Journal of Imaging, 10(11), 269. https://doi.org/10.3390/jimaging10110269