1. Introduction
Information technology and science [
1] are progressing at breakneck speed, making human life more convenient than ever. Everyone currently recognizes the significance of it in daily life. The impact of computers and computer-powered technologies on healthcare and allied disciplines, as well as every other domain, is extensively established. Yoga, Zumba, martial arts, and other hobbies, in addition to standard medical procedures, are commonly recognized as strategies to improve one’s health. Yoga is a set of practices related to a person’s physical, mental, and spiritual well-being that originated in ancient India. Artificial intelligence technologies [
2], such as Pose-Net and Mobile-Net SSD, as well as human posture detection, can be effective in incorporating tech yoga. Human body posture detection remains a difficult task despite extensive research and development in the fields of artificial intelligence and computer vision. Human posture detection has a wide range of applications, from monitoring health to enhancing the security of the public. Yoga has become a popular kind of exercise in modern society, and there is a desire for training on how to perform proper yoga poses. Since performing incorrect yoga postures can result in injury and tiredness, having a trainer on hand is essential, and since many people do not have the financial means for an instructor, this is where AI offers guidance. The instructor must correct and perform each pose for the individual, and only then can the practitioner perform each pose. However, in today’s situation, more people have adapted to using online tools for their needs and are comfortable with this in their homes. This has led to demand for a tech-driven yoga trainer. As a result, the automatic assessment of yoga postures is proposed for yoga posture recognition, which can alert practitioners. Pose-Net and Mobile-Net SSD (together named as TFlite Movenet) play a major role in using the Y_PN-MSSD model. A Pose-Net layer takes care of the feature point detection, while the Mobile-Net SSD layer performs human detection in each frame.
The working of the model is categorized into three stages:
The data collection/preparation stage—where the yoga postures are captured with the consent of four users who are trained yoga professionals, as well as an open-source dataset with seven yoga poses.
The feature extraction stage—by using these collected data, the model undergoes training in which the feature extraction takes place by connecting key points of the human body using Pose-Net layer.
Posture recognition—using the extracted features, yoga postures are recognized using the Mobile-Net SSD layer, which assists the user through yoga poses by live-tracking them as well as correcting them on the fly. Finally, the performance of the Y_PN-MSSD model is analyzed using a confusion matrix and compared with the existing Pose-Net CNN model.
In the past, many researchers have proposed models for tech yoga practices, and these are discussed below under
Section 2. The methodology to recognize yoga postures is given in
Section 3. By incorporating the given methodology, the data were prepared, trained, and tested on the proposed model, which is explained in
Section 4. Finally, as detailed in
Section 5, the results were analyzed and compared with those of the existing model to justify their performance. Deep learning typically needs less ongoing human intervention. Deep learning can analyze images, videos, and unstructured data in ways that machine learning cannot easily do. One of the biggest advantages of using a deep learning approach is its ability to execute feature engineering by itself. In this approach, an algorithm scans the data to identify features that correlate, and then combines them to promote faster learning without being told to do so explicitly. This proposed work uses a unique fusion of Pose-Net and Mobile-Net for better pose estimation when compared with other contemporary works using conventional CNN networks. Additionally, the combination of Pose-Net and Mobile-Net provides better accuracy when compared with other works. These are discussed in detail in the Materials and Methods section and Experimental Results section, respectively.
2. Literature Review
Chen et al. [
3], in their paper titled “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Re-localization”, demonstrate a monocular, six-degrees-of-freedom delocalization system that is reliable and works in real-time. Graph optimization is not required, however, as additional engineering would be needed. Their system uses CNN to infer the six poses from a camera shot of a single RGB image (in an end-to-end way). The method can run both indoors and outdoors in real time and computes each frame in large-scale outside scenes. It achieved an accuracy of around 2 m and 6 degrees, and it can achieve an accuracy of up to 0.5 m and 10 degrees. This is accomplished with the use of a productive, 23-layer deep ConvNet, proving that ConvNets may be utilized to address challenging, out-of-image-plane regression issues. This was made possible by leveraging transfer learning from large-scale classification data. They demonstrate how the network can localize using high-level features, and its resistance to challenging lighting, motion blur, and various intrinsic camera points of view.
Islam et al. [
4], in their paper “Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation”, implemented a joint occlusion method for a human body, which overlapped frequently and led to incorrect pose predictions when used for human pose estimation in monocular images. These conditions may result in pose predictions that are biologically improbable. Human vision, on the other hand, may anticipate postures by taking advantage of the geometric limitations of joint interconnectivity. Alex et al. [
5], in their paper on simple and lightweight human pose estimation, demonstrated using benchmark datasets that the majority of existing methods often aim for higher scores by utilizing complicated architecture or computationally expensive models, while neglecting the deployment costs in actual use. They examine the issue of straightforward and lightweight human posture estimation in this study. Chen et al. [
6], in their paper on continuous trade-off optimization between fast and accurate deep face detectors, demonstrated that DNNs, i.e., deep neural networks, are more effective at detecting faces than shallow or hand-crafted models, but their intricate designs have more computational overheads and slower inference rates. They researched five simple methods in this context to find the best balance between speed and accuracy in face recognition.
Zhang et al. [
7], in their paper “Yoga Pose Classification Using Deep Learning”, proposed a persistent issue in machine vision that has presented numerous difficulties in the past. Many industries, including surveillance cameras, forensics, assisted living, at-home monitoring systems, etc., can benefit from human activity analysis. People typically enjoy exercising at home these days because of our fast-paced lives, but many also experience the need for an instructor to assess their workout form and guide them. Petru et al. [
8], in their paper “Human Pose Estimation with Iterative Error Feedback” presented a deep neural network (ConvNets), a type of deep hierarchical extractor, offering outstanding performance on a range of classifications using only feed-forward neural processing. Although feed-forward architectures are capable of learning detailed descriptions of the input feature space, they do not explicitly describe interconnections in the output spaces, which are highly structured for tasks such as segmenting objects or estimating the pose of an articulated human. Here, they offer a framework that incorporates top-down feedback and broadens the expressive potential of hierarchical feature extractors to include both input and output regions.
Yoli et al. [
9]. in their paper “Yoga Asana Identification: A Deep Learning approach”, describe how yoga is a beneficial exercise that has its roots in India, can revitalize physical, mental, and spiritual wellbeing, and is applicable across all social domains. However, it is currently difficult to apply artificial intelligence and machine learning approaches to transdisciplinary domains such as yoga. Their work used deep-learning methods, such as CNN and transfer learning, to create a system that can identify a yoga position from an image or frame of a video. Andriluka et al. [
10] used an approach for grading yoga poses presented using computerized visuals representing contrastive skeleton features. In order to assign a grade, the primary goal of the yoga pose classification was to evaluate the inputted yoga posture and match it with a reference pose. The research proposed a contrastive skeleton feature representation-based framework for analyzing yoga poses. In order to compare identical encoded pose features, the proposed method first identified skeleton key points of the human body using yoga position images, which act as an input, and their coordinates are encoded into a pose feature, which is used for training along with sample contrastive triplets.
Belagiannis et al.’s [
11] work discusses the introduction of deep-learning-based camera pose estimation, and how deep learning models perform transfer learning based on knowledge gained using generic large datasets. As a result, researchers are able to create more specific tasks by fine-turning the model. They first described the issue, the primary metrics for evaluation, and the initial architecture (PoseNet). Then, they detected recent trends that resulted from various theories about the changes needed to enhance the initial solution. In order to promote their analysis, they explicitly suggested a hierarchical classification of deep learning algorithms that focused on addressing the pose estimation problem. They also explained the presumptions that drove the construction of representative architectures in each of the groupings that were identified. They also offered a thorough cross-comparison of more than 20 algorithms, which was comprised of findings (localization error) across different datasets and other noteworthy features (e.g., reported runtime). They evaluated the benefits and drawbacks of several systems in light of accuracy and other factors important for real-world applications.
Buehler et al. [
12] in their paper have proposed a spatio-temporal solution that is essential to resolving depth and occlusion uncertainty in 3D pose estimation. Prior approaches have concentrated on either local-to-global structures that contained a pre-set spatio-temporal information or temporal contexts. However, effective suggestions to capture various concurrent and dynamic sequences and to perform successful real-time 3D pose estimation have not been implemented. By modelling local and global spatial information via attention mechanisms, the authors of this study enhanced the learning of kinematic constraints in the human skeleton, including posture, local kinematics linkages, and symmetry.
Chiddarwar et al. [
13] concluded that the Pose-Net CNN model was good for yoga posture recognition and developed an android application for the same. The real time data had been captured and processed to detect yoga postures, which guided the practitioners to perform safe yoga by detecting the essential points. Apart from these models, the proposed Y_PN-MSSD model performed well on the captured postures performed by trained yoga practitioners.
Ajay et al. [
14] proposed a system that provided consistent feedback to the practitioner, so that they can identify the correct and incorrect poses with relevant feedback. They had used a data set that consisted of five yoga poses (Natarajasana, Trikonasana, Vrikshasana, Virbhadrasana and Utkatasana), which acted as an input to the deep learning model, which utilized a convolutional neural network for yoga posture identification. This model identified mistakes in a pose and suggested solutions on how the posture can be corrected. Additionally, it classified the identified pose with an accuracy of 95%. This system prevents users from being injured as well as increases their knowledge of a particular yoga pose.
Qiao et al. [
15] presented a real-time 2D human gesture grading system using monocular images based on OpenPose, which is a library for real-time multi-person keypoint detection. After capturing 2D positions of a person’s joints and skeleton wireframe of the body, the system computed the equation of motion trajectory for every joint. The similarity metric was defined as the distance between motion trajectories of standard and real-time videos. A modifiable scoring formula was used for simulating gesture grading scenarios. The experimental results showed that the system worked efficiently with high real-time performance, low cost of equipment and strong robustness to the interference of noise.
A blockchain-based decentralized federated transfer learning [
16] method was proposed for collaborative machinery fault diagnosis. A tailored committee consensus scheme was designed for optimization of the model aggregation process. Here, two decentralized fault diagnosis datasets were implemented for validations. This work was effective in data privacy-preserving collaborative fault diagnosis. This proposed work was suitable for real industry applications. A deep learning-based intelligent data-driven prediction method was incorporated in [
17] to resolve sensor malfunctioning problem. Later on, a global feature extraction scheme and adversarial learning were introduced to fully extract information of sensors as well as extract sensor invariant features, respectively. This proposed work was suitable for real industry applications.
A two stream real-time yoga posture recognition system [
18] was developed and a best accuracy of 96.31% was achieved. A yoga posture coaching system [
19] using six transfer learning models (TL-VGG16-DA, TL-VGG19-DA, TL-MobileNet-DA, TL-MobileNetV2-DA, TL-InceptionV3-DA, and TL-DenseNet201-DA) was exploited for classification of poses, and the optimal model for the yoga coaching system was determined based on evaluation metrics. The TL-MobileNet-DA model was selected as the optimal model as it provided an overall accuracy of 98.43%.
Qian et al. [
20] proposed a contactless perspective approach, which is a kind of vision-based contactless human discomfort pose estimation method. Initially, human pose data were captured from a vision-based sensor, and corresponding human skeleton information was extracted. Five thermal discomfort-related human poses were analyzed, and corresponding algorithms were constructed. To verify the effectiveness of the algorithms, 16 subjects were invited for physiological experiments. The pose estimation algorithm proposed extracted the coordinates of key points of the body based on OpenPose, and then used the correlation between the points to represent the features of each pose. The validation results showed that the proposed algorithms could recognize five human poses of thermal discomfort.
4. Experimental Results
The proposed model provided in this paper uses the layer of a deep learning model to detect wrong yoga postures and correct them to improve. The vectors for nearby joints are used for estimating the angles. The extraction of feature points for pose estimation techniques are characterized in this work. These characteristics are then entered into categorization systems, which give feedback for the correctness of the yoga pose. As a result, the work is split into three parts: (1) feature extraction and time computation for every frame, (2) recognition, and (3) feedback generating time per frame for categorizing yoga poses. Each method’s extraction and computation of features take the same amount of time. This experiment has been performed using NVIDIA GeForce GTX-1080 and Xeon(R) CPU E3-1240 v5. While training and test datasets consist of numerous ups and downs in accuracy until the 200 epoch, an accuracy of 0.9988 was finally attained after 200 epochs. The loss of training and testing dropped gradually after 150 to 200 epochs. This results in coming up with a high confidence training model for classifying yoga postures. The accuracy and loss values have been depicted in
Figure 13. It must be noted that the training was carried out using a dataset that contained frames taken from an open-source dataset and also data captured from four users performing seven yoga poses. These two entities constituted the training dataset. The testing was performed using the real time video captured by the user. Hence, the training accuracy is lower than the validation accuracy. The validation accuracy being higher than the training accuracy is a good indicator that the model performs very well in a real-time scenario. Additionally, another approach was carried out to validate the performance of the model in which the dataset was split into training and validation data in the ratio of 80:20. The training and validation sets were mutually exclusive. The accuracy plot for this approach is depicted in
Figure 14. It can be seen that the validation accuracy was closely following the training accuracy and there were no major deviations between the curves. Additionally, the loss was decreasing significantly with increasing epochs, which once again confirmed the robust result of the model in terms of accuracy of classification.
The loss of training and test datasets gradually decreased from epoch 0 to 200. The model does not appear to be over fitting based on the training and validation accuracy. Because the research is categorizing input features into one of seven labels, the loss function employed is categorical and the confusion matrix is depicted in
Figure 15. This explains the obtained accuracy in a pictorial way, where in the last class-7, the accuracy is slightly fluctuating due to which the obtained accuracy is coming out to be 0.99885976.
Table 2 depicts the details of hyperparameters used in the proposed model. An ablation study was carried out to choose the best possible optimizer. For obtaining the desired accuracy, hypermetric tuning was performed in which we have to change the optimizer, namely Adam, AdaDelta, RMSProp, and Adagrad, along with activation function softmax. The comparison of optimizers along with loss and accuracy in 200 epochs is shown in below
Table 3. From the table, it can be concluded that the proposed Y_PN-MSSD model with Adam optimizer outperforms the other optimizers.
A confusion matrix is used to evaluate the performance of the proposed yoga recognition model. The frame-based metrics and even score are computed using the following metrics. They are True Positive Rate, False Positive Rate, precision, and recall, respectively. The overall performance is compared with the accuracy of the model. The mathematical model for the given metrics are shown below.
The proposed method is tested with the user pressing the “Get ready to pose” button in the user interface when the video is subsequently captured on the fly. The joints will be located and the connections between joints will be shown in white until the user’s pose matches with that of the pose exhibited by the animated image shown. Once the pose is matched, the connections will turn from white to green, indicating to the user that the correct pose has been attained. The video capture is then turned off by hitting the “stop pose” button in the user interface. To explain this better, some additional sample output images indicating the transition from incorrect pose to correct pose on the fly have been included as
Figure 16.
The proposed model result is compared with the existing Pose-Net CNN model with respect to the derived metrics and it is observed that the proposed model outperforms the existing model and is tabulated in
Table 4. It is observed that the false positive rate of the proposed model is less when compared to the existing model. Additionally, the precision, recall and accuracy of the proposed model show the best results, and both the models are trained and tested with the same seven yoga posture dataset.
Table 5 compares the accuracy of the proposed model with other contemporary works involving yoga posture recognition. It can be seen that the proposed system yields the highest accuracy of 99.88%.
5. Conclusions and Future Work
Human posture estimation has been a subject of hot research in recent years. Human posture estimation varies from other computer vision problems in which key points of human body parts are tracked based on a previously defined human body structure. Yoga self-instruction systems have the potential to popularize yoga while also ensuring that it is properly performed. Deep learning models look to be promising because of substantial research in this field. In this paper, the proposed Y_PN-MSSD model is used to recognize seven yoga asanas with an overall accuracy of 99.88%. The accuracy of this model is based on Pose-Net posture assessment and Mobile-Net SSD. A Pose-Net layer takes care of feature point detection, whereas a Mobile-Net SSD layer performs human detection in each frame. This model has been categorized into three stages. In stage one, which is the data collection/preparation stage, the yoga posters are captured from the four users as well as an open-source dataset with seven yoga poses. Then, at the second stage, these collected data have been used for training the model where the feature extraction takes place by connecting key points of the human body. At last, the yoga posture has been recognized. This model will assist the user to perform yoga poses in a live tracking mode and they can correct the posture on the fly. When compared with the Pose-Net CNN model, the proposed model gives better results in terms of accuracy. Additionally, activity recognition is demonstrated in a real-world context and a model such as this could help with pose identification in sports, surveillance, and healthcare in the future. For self-training and real-time forecasting, this model can be used by augmenting the inputs. In future, the yoga posture recognition application can be trained with more number of yoga poses. Additionally, full-fledged training can be performed to build a fully adapted real time model for a real-time noise environment to act as a professional yoga trainer. Future work will be towards addressing other challenges faced during the implementation such as occlusion and illumination changes. Additionally, the proposed model will be enriched to detect more yoga postures. An audio-based alert can be included as part of the future scope to indicate a signal to the user when the correct posture is attained.