Using EfficientNet-B7 (CNN), Variational Auto Encoder (VAE) and Siamese Twins’ Networks to Evaluate Human Exercises as Super Objects in a TSSCI Images
Abstract
:1. Introduction
2. Literature Review
3. Materials and Methods
Skeleton Rotation
- —max() for
- —min() for
- —The normalized value of the skeleton’s X component key points
4. Human Movement as a Static Super Object
4.1. Review of Existing Technics to Describe Human Movements
- Optical flow describes the mobility of pixels or points in an image series by estimating their displacement between consecutive frames [19];
- Researchers use Euler angles and quaternions to represent an object’s orientation in 3D space by establishing rotation angles around the x, y, and z axes [20];
- By charting an object’s position in space at different points in time, researchers can use a trajectory to describe an object’s passage through time [21];
- In motion fields, researchers store the velocities and accelerations of each point on the object to represent an object’s motion over time [22].
4.2. Our New Approach-Movement as a Static Super Object
4.3. Generic Neural Network Implementations with TSSCI
4.4. How to Define and Label Specific Movement as a Super Object
4.5. TSSCI Use Cases Examples Base on Super Objects
4.6. Comments Regarding Table 1
- This table presents a proposal for using existing architectures, but some modifications may be necessary, such as changing the dimensions of the input image. A TSSCI image typically has smaller dimensions than a typical image input. As an example, in the examples we present later in this article, the dimensions of the TSSCI image are 49 × 49 × 3, thus using the VGG network for classification was not appropriate since the image dimensions at the VGG entrance are 240 × 240 × 3, and the depth of the network exceeds our database size. Therefore, overfitting occurred. For this reason, we used an EfficientNet-B7 image classification CNN network. Similarly, particular adaptations will be required to use the architectures listed in the table. We recommend selecting most suitable networks for performing the desirable task and scaling the TSSCI dimensions;
- A few propositions require practical proof because they are theoretically logical inferences. We must train the network extensively with diverse TSSCI images tagged with various texts to formulate a choreography in which DALL-E uses a ritual to control human skeleton movements;
- We recommend using this table to develop additional ideas based on existing architectures, using the super object method for human movement description.
5. Convert a Sequence of Human Skeleton Movements into a TSSI Single Image
5.1. TSSCI Needs a Buffer to Convert Temporal-Spatial Data into Spatial Data
5.2. Methods for Normalizing Skeleton Coordinates for the Implementation as TSSCI Pixels
- Mean and standard deviation: Normalizing the coordinates by calculating their mean and standard deviation [23];
- Min-Max normalization: We can use it to scale coordinates so that they fall within a specified range, such as (0, 1). The minimum and maximum values of the coordinates must be determined first, and then the coordinates must be scaled using the following formula: (x − min)/(max − min) [24];
- Zero-mean normalization: Center the data around zero by subtracting the mean from each coordinate. This method helps remove any bias in the data [25];
- Root mean square normalization (RMSE): To ensure that subsequent analyses are not affected by the scale of the coordinates, this method scales the coordinates so that the root mean square (RMS) is equal to one [26];
- Scaling to unit norm: Scaling coordinates in this manner ensures that the L2 norm of the coordinates is equal to one. It ensures that scale does not affect the results of subsequent analyses [27].
5.3. TSSCI Dataset Augmentation
5.4. CNN-Based Automatic and Manual Video Editing
5.5. Treatment of Low Confidence Level Key Points (Missing Key Points)
6. Movements Classification with Google CNN EfficientNet
7. Variational Auto Encoder (VAE)
8. Demonstration of How to Measure Gesture Mimics Via Siamese Twin Neural Network
EfficientNet Transfer Learning
- is the sigmoid function:
- is the i’th element of the input vector ;
- is the corresponding i’th weight.
- Loss: overall loss that the model incurs in making predictions for a binary classification problem;
- Y: label for a particular data point. It takes a value of 0 or 1 depending on whether the data point belongs to class 0 or class 1;
- : difference between the predicted value and the actual label. It represents how well the model is performing on a particular data point;
- (1 − Y): error incurred by the model when it predicts the negative class;
- : square of the difference between the predicted value and the actual label. It is used to penalize larger differences;
- (Y): error incurred by the model when it predicts the positive class;
- max(0, m − ): margin between the predicted value and the actual label. The max function ensures that this value is always non-negative;
- (1/2): term used to normalize the loss.
- is the therapist’s latent norm, our reference movement that the CNN converts from the TSSCI exercise image into a latent vector.
9. Results
9.1. Creating the Extended Database Using Normalization and Augmentation
9.2. Train and Evaluate the EfficientNet-B7 Model on TSSCI Images
9.3. Results with Variational Auto Encoder
- row—Number of rows in the TSSCI image;
- col—Number of columns in the TSSCI image;
- ide—Number of identical key points in TSSCI.
9.3.1. Synthesizes a New Movement by Combining Two Foreign Movements TRO + SLL
9.3.2. Generator of Synthetic Movements for Specific Types Using Super Object Method
9.3.3. Utilizing a Siamese Twin Neural Network to Score the Quality of Physical Therapy Exercises Performed by Student Relative to Expert Physical Therapists
- D—Euclidean distance between the reference and student latent vectors
- —Scaling factor alpha; we empirically chose alpha as 30.
- s—score, where 0 ≤ s ≤ 100
9.3.4. Comparing the Quality of Synthetic Exercise Generated by VAE to That Provided by Experts Using a Siamese Twin Neural Network
10. Discussion
10.1. Dataset
10.2. Efficient TSSCI Processing with Negligible Resource Consumption
11. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Segal, Y.; Hadar, O.; Lhotska, L. Assessing Human Mobility by Constructing a Skeletal Database and Augmenting it Using a Generative Adversarial Network (GAN) Simulator. In PHealth 2022; IOS Press: Amsterdam, The Netherlands, 2022; pp. 97–103. [Google Scholar] [CrossRef]
- Segal, Y.; Yona, Y.; Danan, O.; Birman, R.; Hadar, O.; Kutilek, P.; Hejda, J.; Hourova, M.; Kral, P.; Lhotska, L.; et al. Camera Setup and OpenPose software without GPU for calibration and recording in telerehabilitation. In IEEE E-Health and Bioengineering; IEEE E-Health and Bioengineering: Lasi, Romania, 2021. [Google Scholar]
- Kutilek, P.; Hejda, J.; Lhotska, L.; Adolf, J.; Dolezal, J.; Hourova, M.; Kral, P.; Segal, Y.; Birman, R.; Hadar, O. Camera System for Efficient non-contact Measurement in Distance Medicine. In Proceedings of the Prague: 2020 19th International Conference on Mechatronics—Mechatronika (ME), Prague, Czech Republic, 2–4 December 2020; pp. 1–6. [Google Scholar]
- Blobel, B.; Oemig, F.; Ruotsalainen, P.; Lopez, D.M. Transformation of Health and Social Care Systems—An Interdisciplinary Approach Toward a Foundational Architecture. Front. Med. 2022, 9, 802487. Available online: https://www.frontiersin.org/articles/10.3389/fmed.2022.802487 (accessed on 6 May 2023). [CrossRef] [PubMed]
- Adolf, J.; Dolezal, J.; Macas, M.; Lhotska, L. Remote Physical Therapy: Requirements for a Single RGB Camera Motion Sensing. In Proceedings of the 2021 International Conference on Applied Electronics (AE), Pilsen, Czechoslovakia, 7–8 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Carissimi, N.; Rota, P.; Beyan, C.; Murino, V. Filling the Gaps: Predicting Missing Joints of Human Poses Using Denoising Autoencoders. In Computer Vision—ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11130, pp. 364–379. [Google Scholar] [CrossRef]
- Koch, G. Siamese Neural Networks for One-Shot Image Recognition. Master’s Thesis, Graduate Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2015. Available online: http://www.cs.toronto.edu/~gkoch/files/msc-thesis.pdf (accessed on 6 May 2023).
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 2017, 43, 7291–7299. [Google Scholar]
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. MediaPipe: A Framework for Perceiving and Processing Reality. Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR). 2019. Available online: https://mixedreality.cs.cornell.edu/s/NewTitle_May1_MediaPipe_CVPR_CV4ARVR_Workshop_2019.pdf (accessed on 6 March 2023).
- Adolf, J.; Dolezal, J.; Kutilek, P.; Hejda, J.; Lhotska, L. Single Camera-Based Remote Physical Therapy: Verification on a Large Video Dataset. Appl. Sci. 2022, 12, 799. [Google Scholar] [CrossRef]
- Liao, Y.; Vakanski, A.; Xian, M. A Deep Learning Framework for Assessing Physical Rehabilitation Exercises. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 468–477. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar]
- Xi, W.; Devineau, G.; Moutarde, F.; Yang, J. Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images. Algorithms 2020, 13, 319. [Google Scholar] [CrossRef]
- Yang, Z.; Li, Y.; Yang, J.; Luo, J. Action Recognition With Spatio–Temporal Visual Attention on Skeleton Image Sequences. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2405–2415. [Google Scholar] [CrossRef]
- Caetano, C.; Sena, J.; Brémond, F.; Santos, J.A.D.; Schwartz, W.R. SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition. arXiv 1907, arXiv:1907.13025. [Google Scholar]
- Ren, B.; Liu, M.; Ding, R.; Liu, H. A Survey on 3D Skeleton-Based Action Recognition Using Learning Method. arXiv 2020, arXiv:2002.05907. [Google Scholar]
- Ma, L.; Jia, X.; Sun, Q.; Schiele, B.; Tuytelaars, T.; Gool, L.V. Pose Guided Person Image Generation. arXiv 2018, arXiv:1705.09368. [Google Scholar]
- Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints. arXiv 2019, arXiv:1909.05704. [Google Scholar]
- Barron, J.L.; Fleet, D.J.; Beauchemin, S.S.; Burkitt, T.A. Performance of Optical Flow Techniques. Int. J. Comput. Vis. 1994, 12, 43–77. [Google Scholar] [CrossRef]
- Kuipers, J.B. Quaternions and Rotation Sequences: A Primer with Applications to Orbits, Aerospace and Virtual Reality; Princeton University Press: Princeton, NJ, USA, 2002. [Google Scholar]
- LaValle, S.M. Planning Algorithms; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
- Visual Reconstruction, MIT Press. Available online: https://mitpress.mit.edu/9780262524063/visual-reconstruction/ (accessed on 20 January 2023).
- Osokin, D. Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose. arXiv 2018, arXiv:1811.12004. [Google Scholar]
- Brownlee, J. How to Normalize and Standardize Time Series Data in Python. MachineLearningMastery.com. 11 December 2016. Available online: https://machinelearningmastery.com/normalize-standardize-time-series-data-python/ (accessed on 21 January 2023).
- Normalization, Codecademy. Available online: https://www.codecademy.com/article/normalization (accessed on 21 January 2023).
- How to Normalize the RMSE. Available online: https://www.marinedatascience.co/blog/2019/01/07/normalizing-the-rmse// (accessed on 21 January 2023).
- Boudreau, E. Unit-Length Scaling: The Ultimate In Continuous Feature-Scaling? Medium, 27 July 2020. Available online: https://towardsdatascience.com/unit-length-scaling-the-ultimate-in-continuous-feature-scaling-c5db0b0dab57(accessed on 21 January 2023).
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. Available online: https://proceedings.mlr.press/v97/tan19a.html (accessed on 18 January 2023).
- van der Maaten, L. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Neural Network | Regular Image | TSSCI Application |
---|---|---|
EfficientNet-B7 | Object classifier | A system for classifying and labeling movements |
UNET | Semantic segmentation | Colors and marks the super object inside the TSSCI image. Allows extrapolation between two movements. |
YOLO | Object detection and tracking | Locating a particular movement in a film. Includes the option to extract specific movements. |
ESRGAN | Super-resolution | Enhancing motion captured at a low frame rate to a higher frame rate. |
DAE or DnCNN | Denoising auto-encoder or a denoising convolutional neural network | Restoration of the skeleton’s missing key points |
NST | Transfer the style of one image to the content of another image | Changing dance style from hip-hop to ballet while maintaining the original movements |
DC-GAN | Generating fake images | Creating fake movements |
VAE-based Image Composition | Generate new images that combine features from the two images | Combining two different movements, such as jumping and clapping, to create a new movement |
Transformer-XH | Predict the next frame in a video | Predict the next movement in a sport game |
Grid-CNN | Predict a 3D model from 2D images (stereo reconstruction) | Create a 3D model of the skeleton from a 2D skeleton (stereo reconstruction) |
DALL-E | Generate images from natural language descriptions | Create an all-encompassing choreography based on natural language descriptions |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Segal, Y.; Hadar, O.; Lhotska, L. Using EfficientNet-B7 (CNN), Variational Auto Encoder (VAE) and Siamese Twins’ Networks to Evaluate Human Exercises as Super Objects in a TSSCI Images. J. Pers. Med. 2023, 13, 874. https://doi.org/10.3390/jpm13050874
Segal Y, Hadar O, Lhotska L. Using EfficientNet-B7 (CNN), Variational Auto Encoder (VAE) and Siamese Twins’ Networks to Evaluate Human Exercises as Super Objects in a TSSCI Images. Journal of Personalized Medicine. 2023; 13(5):874. https://doi.org/10.3390/jpm13050874
Chicago/Turabian StyleSegal, Yoram, Ofer Hadar, and Lenka Lhotska. 2023. "Using EfficientNet-B7 (CNN), Variational Auto Encoder (VAE) and Siamese Twins’ Networks to Evaluate Human Exercises as Super Objects in a TSSCI Images" Journal of Personalized Medicine 13, no. 5: 874. https://doi.org/10.3390/jpm13050874
APA StyleSegal, Y., Hadar, O., & Lhotska, L. (2023). Using EfficientNet-B7 (CNN), Variational Auto Encoder (VAE) and Siamese Twins’ Networks to Evaluate Human Exercises as Super Objects in a TSSCI Images. Journal of Personalized Medicine, 13(5), 874. https://doi.org/10.3390/jpm13050874