Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos
Abstract
:1. Introduction
1.1. Background Introduction
1.2. Related Work
2. Dataset Preprocessing
2.1. Image Normalization and Face Detection
2.2. Facial Image Standardization
3. Deep Learning Neural Networks
3.1. Convolutional Neural Network
3.2. Video Vision Transformer
3.3. Recurrent Neural Network
3.4. Hybrid Models
3.5. Nine Neural Network (NN) Models
4. Experimental Results
4.1. Vital Video Dataset Description
- A total of 891 participants (balanced gender/age distribution including all skin tones);
- Uncompressed 1920 × 1200, 30 FPS facial videos (see Figure 1b);
- Synchronized PPG ground truth;
- Blood pressure spot measurement collected during video recording;
- Sex and age recorded.
- i.
- Heart rate (HR): one reading per second in beats per minute, e.g., 72;
- ii.
- Blood oxygen saturation level (SpO2): one reading per second in percentage (≤100%), e.g., 98;
- iii.
- Blood pressure (BP): one reading per second, including systolic pressure and diastolic pressure, in millimeters of mercury (mm Hg), e.g., 126/66;
- iv.
- Subject sex: e.g., M;
- v.
- Subject age: e.g., 22;
- vi.
- PPG waveform: 56 readings per second in percentage (not used in this study), e.g., 35;
- vii.
- Subject skin color: labeled as 1 (White)–6 (dark brown/black) according to Fitzpatrick skin types (not used in this study), e.g., 4.
4.2. Nine NN Model Performance
4.3. Score Fusion Performance
4.4. Time Costs of Nine NN Models
5. Discussion and Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bousefsaf, F.; Maaoui, C.; Pruski, A. Peripheral vasomotor activity assessment using a continuous wavelet analysis on webcam photoplethysmographic signals. Bio. Med. Mater. Eng. 2016, 27, 527–538. [Google Scholar] [CrossRef] [PubMed]
- Jeong, I.C.; Finkelstein, J. Introducing contactless blood pressure assessment using a high speed video camera. J. Med. Syst. 2016, 40, 77. [Google Scholar] [CrossRef] [PubMed]
- Shao, D.; Liu, C.; Tsow, F.; Yang, Y.; Du, Z.; Iriya, R.; Yu, H.; Tao, N. Noncontact monitoring of blood oxygen saturation using camera and dual-wavelength imaging system. IEEE Trans. Biomed. Eng. 2016, 63, 1091–1098. [Google Scholar] [CrossRef] [PubMed]
- Poh, M.Z.; McDuff, D.J.; Picard, R.W. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 2011, 58, 7–11. [Google Scholar] [CrossRef] [PubMed]
- Poh, M.Z.; McDuff, D.J.; Picard, R.W. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express 2010, 18, 10762–10774. [Google Scholar] [CrossRef]
- Chen, W.; McDuff, D. Deepphys: Video-based physiological measurement using convolutional attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 349–365. [Google Scholar]
- Hu, M.; Qian, F.; Wang, X.; He, L.; Guo, D.; Ren, F. Robust heart rate estimation with spatial–temporal attention network from facial videos. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 639–647. [Google Scholar] [CrossRef]
- Lokendra, B.; Puneet, G. And-rppg: A novel denoising-rppg network for improving remote heart rate estimation. Comput. Biol. Med. 2022, 141, 105146. [Google Scholar] [CrossRef] [PubMed]
- Yin, R.N.; Jia, R.S.; Cui, Z.; Sun, H.M. Pulsenet: A multitask learning network for remote heart rate estimation. Knowl.-Based Syst. 2022, 239, 108048. [Google Scholar] [CrossRef]
- Li, B.; Zhang, P.; Peng, J.; Fu, H. Non-contact ppg signal and heart rate estimation with multi-hierarchical convolutional network. Pattern Recogn. 2023, 139, 109421. [Google Scholar] [CrossRef]
- Luo, H.; Yang, D.; Barszczyk, A.; Vempala, N.; Wei, J.; Wu, S.J.; Zheng, P.P.; Fu, G.; Lee, K.; Feng, Z.P. Smartphone-based blood pressure measurement using transdermal optical imaging technology. Circ. Cardiovasc. Imaging 2019, 12, e008857. [Google Scholar] [CrossRef] [PubMed]
- Wu, B.F.; Chiu, L.W.; Wu, Y.C.; Lai, C.C.; Chu, P.H. Contactless blood pressure measurement via remote photoplethysmography with synthetic data generation using generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2130–2138. [Google Scholar]
- Uchi, K.; Miyazaki, R.; Cardoso, G.C.; Ogawa-Ochiai, K.; Tsumura, N. Remote estimation of continuous blood pressure by a convolutional neural network trained on spatial patterns of facial pulse waves. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2139–2145. [Google Scholar]
- Song, R.; Chen, H.; Cheng, J.; Li, C.; Liu, Y.; Chen, X. PulseGAN: Learning to Generate Realistic Pulse Waveforms in Remote Photoplethysmography. IEEE J. Biomed. Health Inform. 2021, 25, 1373–1384. [Google Scholar] [CrossRef]
- de Haan, G.; Jeanne, V. Robust Pulse Rate From ChrominanceBased rPPG. IEEE Trans. Biomed. Eng. 2013, 60, 2878–2886. [Google Scholar] [CrossRef]
- Yu, Z.; Peng, W.; Li, X.; Hong, X.; Zhao, G. Remote Heart Rate Measurement From Highly Compressed Facial Videos: An End-to-End Deep Learning Solution With Video Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Yu, Z.; Shen, Y.; Shi, J.; Zhao, H.; Torr, P.H.S.; Zhao, G. PhysFormer: Facial Video-Based Physiological Measurement with Temporal Difference Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Hu, M.; Qian, F.; Guo, D.; Wang, X.; He, L.; Ren, F. ETArPPGNet: Effective Time-Domain Attention Network for Remote Heart Rate Measurement. IEEE Trans. Instrum. Meas. 2021, 70, 2506212. [Google Scholar] [CrossRef]
- Gideon, J.; Stent, S. The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Hsu, G.-S.J.; Xie, R.-C.; Ambikapathi, A.; Chou, K.-J. A deep learning framework for heart rate estimation from facial videos. Neurocomputing 2020, 417, 155–166. [Google Scholar] [CrossRef]
- Omer, O.A.; Salah, M.; Hassan, L.; Abdelreheem, A.; Hassan, A.M. Video-based beat-by-beat blood pressure monitoring via transfer deep-learning. Appl. Intell. 2024, 54, 4564–4584. [Google Scholar] [CrossRef]
- Cheng, C.-H.; Yuen, Z.; Chen, S.; Wong, K.-L.; Chin, J.-W.; Chan, T.-T.; So, R.H.Y. Contactless Blood Oxygen Saturation Estimation from Facial Videos Using Deep Learning. Bioengineering 2024, 11, 251. [Google Scholar] [CrossRef]
- Jaiswal, K.B.; Meenpal, T. Heart rate estimation network from facial videos using spatiotemporal feature image. Comput. Biol. Med. 2022, 151, 106307. [Google Scholar] [CrossRef] [PubMed]
- Lin, B.; Tao, J.; Xu, J.; He, L.; Liu, N.; Zhang, X. Estimation of vital signs from facial videos via video magnification and deep learning. iScience 2023, 26, 107845. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Jinsoo, P.; Kwangseok, H. Facial Video-Based Robust Measurement of Respiratory Rates in Various Environmental Conditions. J. Sens. 2023, 2023, 9207750. [Google Scholar] [CrossRef]
- Zheng, Y.; Wang, H.; Hao, Y. Mobile application for monitoring body temperature from facial images using convolutional neural network and support vector machine. In Mobile Multimedia/Image Processing, Security, and Applications 2020; Proceedings SPIE 11399; SPIE: Bellingham, WA, USA, 2020; p. 113990B. [Google Scholar] [CrossRef]
- Toye, P.J. Vital Videos: A dataset of face videos with PPG and blood pressure ground truths. arXiv 2023, arXiv:2306.11891. [Google Scholar]
- Yang, M.H.; Kriegman, D.J.; Ahuja, N. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 34–58. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M. Robust real-time object detection. Int. J. Comput. Vis. 2001, 57, 137–154. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M. Rapid Object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 511–518. [Google Scholar]
- Papageorgiou, C.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 7 January 1998; pp. 555–562. [Google Scholar]
- Freund, Y.; Schapire, R. A decision theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Computational Learning Theory: Eurocolt’95, Barcelona, Spain, 13–15 March 1995; pp. 23–37. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 1, pp. 1097–1105. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- University of Oxford, Visual Geometry Group. Available online: http://www.robots.ox.ac.uk/~vgg/research/very_deep/ (accessed on 1 November 2024).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805v2. [Google Scholar]
- Wensel, J.; Ullah, H.; Munir, A. ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos. 2022. Available online: https://arxiv.org/pdf/2208.07929.pdf (accessed on 16 August 2022).
- Kuncheva, L.I. A Theoretical Study on Six Classifier Fusion Strategies. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 281–286. [Google Scholar] [CrossRef]
- Prabhakar, S.; Jain, A.K. Decision-level Fusion in Fingerprint Verification. Pattern Recognit. 2002, 35, 861–874. [Google Scholar] [CrossRef]
- Ulery, B.; Hicklin, A.R.; Watson, C.; Fellner, W.; Hallinan, P. Studies of Biometric Fusion; NIST Interagency Report; US Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, USA, 2006. [Google Scholar]
- Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Ref.\Study | Method | Vital Sign | Dataset (Size) | Result (RMSE) |
---|---|---|---|---|
[22] | CNN | SpO2 | VIPL-HR (107 subjects) | 1.71% |
[23] | CNN (ResNet-18) | HR | VIPL-HR (107 subjects) | 7.21 |
[24] | video magnification, CNN | HR, SBP, DBP | Private (288 videos) | 5.33, 4.11, 3.75 |
[25] | FFT | RR | Private (unknown size) | 1.01 |
[26] | CNN-SVM | Body temperature | Private (144 subjects) | 0.35 |
(This Study) | Hybrid CNN model | HR, SpO2, SBP, DBP | VVD-Large (891 subjects) [27] | 5.55, 0.87, 4.07, 2.21 |
HMdl_ab: ViViT ⊕ ConvLSTM, 1072.0 M Paras | |||
---|---|---|---|
Layer (Type) | Output Shape | Layer (Type) | Output Shape |
Frame_Input | (30, 288, 256, 3) | ||
Stem: time_distr (Conv2D): (filters, kernel_size, strides) = (64, (4, 4), (4, 4)) | (30, 72, 64, 64) | ||
ViViT (Mdl_a): Tubelet_Embedding: (Conv3D → 30 × 12 × 16 patches) (Conv3D → 30 × 3 × 4 patches) | (360, 768) | ConvLSTM (Mdl_b): (filters, kernel_size, strides) = (128, (3, 2), (3, 2)) | (30, 24, 32, 128) |
Add (Positional_Encoder) | (360, 768) | (256, (2, 2), (2, 2)) | (30, 12, 16, 256) |
8 × Attn_FF: | (512, (2, 2), (2, 2)) | (30, 6, 8, 512) | |
MultiHeadAttention: (heads = 8, key_dim = 64) | (768, (2, 2), (2, 2)) | (30, 3, 4, 768) | |
Feed_Forward_Net: (768 → 3072 → 768) | |||
Add (Attn_FF_Norm) | (360, 768) | ||
Flatten | (276480) | Flatten | (276480) |
Concat | (552960) | ||
6 × Head: | |||
Dense | (256) | ||
Dense | (1) | ||
Concat (Output) | (6) |
HMdl_abc: ViViT ⊕ ConvLSTM ⊕ ResCNN, 1569.2 M Paras | |||
---|---|---|---|
Layer (Type) | Output Shape | Layer (Type) | Output Shape |
Frame_Input | (30, 288, 256, 3) | ||
Stem: time_distr (Conv2D): (filters, kernel_size, strides) = (64, (4, 4), (4, 4)) | (30, 72, 64, 64) | ||
ViViT (Mdl_a): (from Hybrid_Model_1) … | ConvLSTM (Mdl_b): (from Hybrid_Model_1) … | ResCNN (Mdl_c): time_distr (Conv2D): (filters, kernel_size, strides) = (64, (7, 7), (1, 1)) | (30, 72, 64, 64) |
(128, (5, 5), (3, 2)) | (30, 24, 32, 128) | ||
(256, (3, 3), (2, 2)) | (30, 12, 16, 256) | ||
(512, (3, 3), (2, 2)) (768, (3, 3), (2, 2)) | (30, 6, 8, 512) (30, 3, 4, 768) | ||
Flatten (276480) | Flatten (276480) | Flatten | (276480) |
Concat (ViViT, convLSTM, ResCNN) | (829440) | ||
6 × Head: | |||
Dense | (256) | ||
Dense | (1) | ||
Concat (Output) | (6) |
Model\Vital Sign | HR | SpO2 | SBP | DBP | Sex | Age |
---|---|---|---|---|---|---|
ResNet-50, 118.1 M 61,440 | 11.11 (17.16) | 1.54 (1.15) | 9.62 (22.14) | 6.51 (13.95) | 0.06 (0.49) | 6.27 (20.95) |
Xception, 115.4 M 61,440 | 7.90 (15.32) | 1.24 (1.89) | 6.11 (20.74) | 3.86 (12.35) | 0.04 (0.49) | 3.60 (20.81) |
Mdl_a, 552.0 M ViViT: 276,480 | 7.02 (12.14) | 0.99 (1.52) | 6.69 (17.16) | 3.72 (10.10) | 0.02 (0.50) | 4.38 (19.18) |
Mdl_b, 520.0 M ConvLSTM: 276,480 | 7.12 (10.51) | 1.04 (1.19) | 5.92 (15.20) | 3.75 (8.61) | 0.02 (0.50) | 4.40 (17.94) |
Mdl_c, 497.2 M ResCNN: 276,480 | 6.60 (12.61) | 0.90 (1.54) | 5.61 (17.68) | 3.34 (10.72) | 0.01 (0.50) | 3.65 (19.74) |
HMdl_ab, 1072.0 M 552,960 | 6.49 (11.50) | 0.93 (1.35) | 5.18 (16.30) | 3.05 (9.50) | 0.02 (0.50) | 3.01 (18.49) |
HMdl_ac, 1049.1 M 552,960 | 6.33 (12.64) | 0.88 (1.57) | 4.71 (17.91) | 2.80 (10.67) | 0.03 (0.50) | 2.72 (19.68) |
HMdl_bc, 1017.2 M 552,960 | 6.56 (11.37) | 0.97 (1.38) | 4.94 (16.34) | 2.89 (9.55) | 0.02 (0.50) | 3.23 (18.56) |
HMdl_abc, 1569.2 M 829,440 | 6.42 (11.67) | 0.88 (1.43) | 4.84 (16.56) | 2.80 (9.67) | 0.00 (0.50) | 2.56 (18.96) |
Fusion\Vital Sign | HR | SpO2 | SBP | DBP | Sex | Age |
---|---|---|---|---|---|---|
Mean Fusion | 6.18 | 0.87 | 4.65 | 2.70 | 0.01 | 2.60 |
SVM Regr. | 5.86 | 0.98 | 7.66 | 3.86 | 0.09 | 3.52 |
Gradient Boosting | 5.53 | 0.89 | 4.38 | 2.34 | 0.00 | 2.38 |
RF Regr. (100, 40) | 5.55 | 0.87 | 4.07 | 2.21 | 0.00 | 2.30 |
Metric\NN Model | ResNet-50 | Xception | Mdl_a | Mdl_b | Mdl_c | HMdl_ab | HMdl_ac | HMdl_bc | HMdl_abc |
---|---|---|---|---|---|---|---|---|---|
# Parameters | 118.1 M | 115.4 M | 552.0 M | 520.0 M | 497.2 M | 1072.0 M | 1049.1 M | 1017.2 M | 1569.2 M |
# Features | 61,440 | 61,440 | 276,480 | 276,480 | 276,480 | 552,960 | 552,960 | 552,960 | 829,440 |
Training time (s/epoch) | 235 | 256 | 348 | 420 | 396 | 519 | 477 | 566 | 1018 |
Testing time (ms/sample) | 8.3 | 9.3 | 23.5 | 23.5 | 23.5 | 23.5 | 23.5 | 23.5 | 43.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, Y. Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos. BioMedInformatics 2025, 5, 6. https://doi.org/10.3390/biomedinformatics5010006
Zheng Y. Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos. BioMedInformatics. 2025; 5(1):6. https://doi.org/10.3390/biomedinformatics5010006
Chicago/Turabian StyleZheng, Yufeng. 2025. "Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos" BioMedInformatics 5, no. 1: 6. https://doi.org/10.3390/biomedinformatics5010006
APA StyleZheng, Y. (2025). Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos. BioMedInformatics, 5(1), 6. https://doi.org/10.3390/biomedinformatics5010006