1. Introduction
Recently, the development of human pose estimation (HPE) technology has stimulated research in the field of digital fitness. Digital fitness is garnering attention because it allows for exercise guidance remotely through automated systems, even when face-to-face interactions between instructors and learners are challenging. In particular, there has been an active pursuit of research [
1,
2,
3,
4,
5,
6,
7] in which artificial intelligence analyzes exercise videos and supports self-training for exercise without communication with a sports instructor.
Self-training is a learning method where individuals observe professional athletes’ movements and mimic those actions to improve their own sports abilities. Learners observe actual sports players’ movements, apply those movements to their own exercise routines, and analyze areas where they fall short. Through this process, they can self-correct and ultimately achieve movements that are similar to those of professional athletes. Systems that support this process automatically compare the movements of professional athletes and learners, and, by highlighting differences in movement, enable the learners to self-correct. Self-training is attracting attention, especially in high-cost sports fields, because it offers a more affordable way to learn. Particularly in the golf lesson market, where high costs were previously required, automated golf pose analysis technology has been introduced through image processing and human pose estimation research. This technology allows learners to self-study the movements of professional golfers, thus making it possible to receive golf lessons at a lower cost. The introduction of such technology has the effect of reducing the cost burden for those starting golf.
In automated golf lesson systems, the key feature of self-training is utilizing human pose estimation technology to extract the swing postures of professional golfers and learners and identifying the differences in their movements to provide feedback to the learner. In this process, the swing movements of professional golfers are transformed into embedding vectors, and algorithms are used to detect discrepancies between movements.
Conventional golf posture correction studies have extracted the joint positions of golfers through HPE technology and presented the differences in movements to learners by comparing similarities between the embedding vectors of golfer swings represented through a CNN (Convolutional Neural Network). However, there are two problems with this approach.
Firstly, as golf swing movements tend to be performed at high speeds, there is a high likelihood of obtaining blurry or ambiguous images. As a result, the accuracy of the extracted coordinates in the pose estimation process may be compromised. Therefore, it is necessary to improve the accuracy of the extracted coordinates through a pose refinement process.
Secondly, the conventional image embedding representation method does not contain enough information to compare golf swing movements. The golf swing should consider geometric elements, such as shoulder rotation angle, the angle between the shoulder and elbow, and pelvic angle, as well as the physical differences of the golfer performing the swing. In prior research [
2], CNNs have been utilized to embed images of golf swings, facilitating the identification of discrepancies through the computation of similarity against reference embeddings. Nonetheless, this methodology presents problems. The derived similarity might be influenced by extraneous variables, such as background variations in images or disparities in the physical stature of the subjects under comparison. Consequently, the embedding vector’s similarity can be affected by factors beyond the mere golf swing dynamics. Furthermore, while the system can detect a frame-wise deviation between two motions, it lacks the precision to identify the specific joints responsible for the discrepancy. This leads to the problem of being able to tell the user in which frame the mismatch occurs but not being able to give them direction for improvement.
In this study, we introduced a method that applies a coordinate correction network to enhance the accuracy of joint coordinates obtained through pose estimation. Our approach focuses on proposing an embedding method to accurately represent a golfer’s swing, utilizing features such as geometric analysis, physical characteristics, and swing style. We aim to offer a comprehensive golf analysis system that suggests improved postures to learners using template-based image and video generation. The detailed benefits and unique aspects of our approach, including specialized information representation and the provision of interpretable insights, will be expanded upon in the Method section. The key contributions of our research are as follows:
We introduce a coordinate correction network to improve the performance of the joint coordinates extracted through human pose estimation technology, thereby enhancing the accuracy of posture analysis.
We propose a golf swing embedding technique that allows for more accurate representation of golf swing movements, enabling specialized golf swing analysis.
Unlike previous learning methods, we can provide specific advice for user movement correction through interpretable embedding analysis.
2. Related Works
2.1. Pose Estimation Research
In [
8], the field of two-dimensional (2D) human pose estimation research was categorized into Single Person Pose Estimation models and Multi-Person Pose Estimation models. Additionally, each was further divided into regression and heatmap methods, as well as bottom-up and top-down detection methods. Single Person Pose Estimation using the regression method detects joints by regressing joint coordinates directly from the feature map of the image. This method is fast, direct, and trained in an end-to-end fashion. Because of this, it can be applied to three-dimensional (3D) joint estimation without any change. However, it is difficult to train joint positions, and it is not applicable to multiple person pose estimation. BlazePose [
9] used the regression method and employed tracking to utilize previous coordinate information for predicting the next coordinates. It is an extended model that can estimate 3D coordinates, obtaining x, y, and z coordinates. However, when using motion as input, the lower the image quality, the more non-detection issues occur in all coordinates. Single Person Pose Estimation using the heatmap method infers joint coordinates through heatmap prediction of the expected joint positions. It is easy to visualize and applicable to complex cases. However, it requires much memory to obtain heatmaps and is difficult to extend to 3D coordinate estimation. HRNet (High-Resolution Network) [
10] uses the heatmap method and applies multiple resolutions in parallel during training to learn models that capture both global and local contexts. This model extracted human joint coordinates with higher performance than other models used in the experiments, but issues of false detection and coordinate inversion still existed. Bottom-up coordinate detection for multi-person estimation is a method that estimates the joint coordinates of people in a video and then distinguishes individuals. It detects coordinates quickly by first finding coordinates and then detecting individuals, but it has the disadvantage of lower performance. OpenPose [
11], a bottom-up coordinate detection model, distinguishes individuals through features of body parts, thus improving accuracy. However, non-detection issues occurred when using golf swing motion data. Top-down coordinate detection for multiple person estimation is a method that first detects a person and then iteratively estimates the joint coordinates of the single person. It detects more accurate coordinates but is slower since it detects the person first and then iteratively estimates their pose. To improve coordinate estimation performance, both the human detection and detected human pose estimation parts need refinement. Geometric and spatial transformation processes using STN (Spatial Transformation Network) and SDTN (Spatial Detransformer Network) were suggested in [
12]. This network [
12] extracts high-quality human candidate frames and shows features that improve recognition performance. Moreover, ref. [
13], currently a state-of-art model, significantly improved pose estimation performance through the Vision transformer. However, the current model size ranges from 1 million to 1 billion, which requires high computational costs.
2.2. Pose Refinement Research
Pose refinement research aimed at improving coordinate accuracy in human pose estimation models can be categorized into end-to-end methods and pose-processing methods [
14]. This research [
14] has focused on enhancing the accuracy of the coordinates estimated by human pose estimation models, which is one of the main topics of this paper. Models in [
15,
16,
17,
18,
19,
20,
21] are studied using an end-to-end learning approach. Although the implementation methods differ among models, they share the characteristic that pose estimation and refinement occur together within the model. In the case of [
19], a method was used that employs iterative error feedback, transferring errors step-by-step within the model and incrementally improving the pose by correcting the model’s estimation results. Similar to other models, the step-by-step pose refinement processes are implemented together with the pose estimation process within the model. In [
20], a PRM (Pose Refine Machine) is used to improve the estimated pose within the model at the last step of the pose estimation process. In this study [
20], PRM makes more precise pose estimation possible by using high-level discriminative semantic information and low-level spatial information. In [
21], a structure was used to estimate human pose through a two-step process. A GPR (Graph Pose Refinement) module in the second step was used to obtain an improved pose. The GPR module was designed as a refinement module with a graph structure that considers the relationship between joints. These kinds of end-to-end refinement modules rely on the estimation results of the pose estimation model for their output values, which does not guarantee that the refinement module will work successfully within the model.
The models that use a post-processing method in performance refinement research are in [
14,
22,
23]. These models have a network structure that corrects the coordinates by receiving the coordinates output by the human pose estimation model. Similar implementation methods are shown in [
14,
22], but the method proposed in [
14], which showed higher performance, used a method that obtains corrected coordinates by passing the input image and coordinates through a CNN (Convolutional Neural Network) backbone + upsampler structure. In [
23], research was conducted to recognize actions using extracted joints, correcting coordinates through a PRM (pose refinement module), and then utilizing them. The PRM used in [
23] was designed as a structure that obtains corrected poses by passing through a GCL (Graph Convolutional Layer) and TCL (Temporal Convolutional Layer). Unlike the end-to-end methods that operate within the model, these post-processing methods are applied after the model estimates the coordinates, which reduces model dependency. However, they also present the problem of increased computational requirements due to the additional coordinate correction process.
2.3. Self-Training Research
Digital fitness can be classified in three ways depending on the participation of the user and the instructor (
Figure 1). First, there is the passive approach [
24], which occurs through an online assessment by a human instructor. This method involves evaluating the student’s movements through real-time video calls or video submissions and providing feedback, following the model of traditional offline fitness coaching. The advantage of this approach is that it allows for expert guidance. However, it has the drawback of being dependent on human resources, as it requires the assistance of a human instructor.
Second, the hybrid approach [
25] allows users to receive help from both a human instructor and an automated system. This approach offers the advantage of unrestricted learning through an automated system and the ability to receive expert assessments. This method allows for more precise and varied perspectives on information, but it has the limitation of still requiring human expert participation in terms of cost savings.
Lastly, the self-training approach [
1,
2,
3,
4,
5,
6,
7] is based on learners following expert movements or standardized movements on their own without the intervention of a human instructor. In this approach, learners can learn on their own through expert workout videos, and there is an approach that uses automated systems to compare the postures of experts and learners, analyze the differences, and provide guidance. The advantage of this approach is that it does not require human resources, making it cost-effective and allowing for training without time constraints. However, this method heavily relies on the learner’s will and the system’s performance, so the effectiveness of the exercise can vary depending on the individual.
3. Method
In this section, we detail our methodology designed to compare a user’s golf swing with that of a professional golfer. All of the process is illustrated in
Figure 2. Our approach consists of two stages. The first stage revolves around pose estimation and its pose-processing refinement for accurate pose guidance illustrated in the HPE and PRN of
Figure 2a. To achieve this, we produced pose error data by simulating incorrect poses based on real-world data. These data then inform a sequential model, which detects errors in pose joints extracted from our primary pose estimation model. In the next stage, which is illustrated in the Norm and Vector Representation of
Figure 2a, we translate the swing motion into a vector form that consists of explainable feature values, such as gender, ratio, angle, etc., allowing for a direct comparison between two distinct golf swing motions. We compare the user’s swing embedding vector with the pro’s embedding to be aware of most similar pro golfers and to detect the discrepancy joint for self-learning. By visualizing the differences between these motions, we aim to facilitate the user’s self-guided learning.
3.1. Pose Refinement Process
In the pose estimation process for a golf swing video, there are some problems in detecting joint coordinates because of the fast motion. For instance, as shown in
Figure 3, when the swing movement is fast, coordinate estimation errors occur more frequently. These coordinate errors can generate inaccurate guidance during posture analysis and error calculation. To address this problem, we propose a method to effectively remove outliers with a rule-based outlier detection method and low computational cost by training a network using a single-layer Bi-LSTM (Bidirectional Long Short-Term Memory). The Bi-LSTM model for removing outliers is trained on data mimicking outlier coordinates that occur during the pose estimation process of a pose estimation model and outputs whether outliers occur in each joint’s frame. We suggest an outlier detection algorithm that can be used in some body parts that cause several pose errors in golf swing motion by utilizing features of the golf swing posture. By removing the joint coordinates of the frames with outliers, we interpolate the missing coordinates through interpolation methods.
Figure 4 shows the overall structure of the pose refinement process.
3.1.1. Data Generation Technique for Learning Pose Errors
The proposed algorithm includes steps to remove outliers from estimated coordinates and interpolate missing coordinates. For outlier data collection, we used a data generation technique that mimics outliers based on data containing the correct joint coordinate labels. As illustrated in
Figure 3, outliers in existing pose estimation models show a tendency for joint coordinates to be estimated to be a certain arbitrary distance, x, y, away from the actual correct coordinates. To mimic this, we applied amplification or attenuation to arbitrary x- and y-values from the actual correct joint coordinates to generate data mimicking actual outliers. The amplification and attenuation of the coordinates are applied to randomly selected frames so that the model can capture changes in consecutive frames.
3.1.2. Outlier Detection Model and Correction Algorithm
The outlier detection model uses a single-layer Bi-LSTM (Bidirectional Long Short-Term Memory) network and, when an outlier is detected, it is indicated in the output value of the corresponding frame. This model can be used for all joints of the body to detect outliers.
Figure 5 shows the changes in coordinates by frame and visualizes the spike point on the graph. This graph confirmed that when outliers occur, the coordinate changes are abnormally large compared to previous frames. As seen in
Figure 3, poses can be estimated at a certain level of accuracy in actions with slower speeds, but the proportion of inaccurate coordinate estimations increases as the action speed increases. The data input into the Bi-LSTM network is the coordinate change amount for each frame, calculated as joint coordinate distance changes between frames using the Euclidean distance method. Each piece of data contains a change amount for one joint. As the change amount value range differs depending on the video size, it is divided by the maximum value of each piece of data for normalization between 0 and 1. The network’s maximum input length is also set, with padding values of 0 provided for shorter cases. The proposed algorithm includes steps to remove outliers from estimated coordinates and interpolate missing coordinates. The (b) network is a frame-by-frame outlier detection model using the Bi-LSTM model, which outputs the occurrence of outlier data in the coordinates estimated by the pose estimation model. For outlier data collection, we used a data generation technique that mimics outliers based on data containing the correct joint coordinate labels. Outliers in existing pose estimation models occur when the joint coordinates are estimated to be a certain arbitrary distance x, y away from the actual correct coordinates. To mimic this, we used the generated data in the previous step. The amplification and attenuation of the coordinates was applied to randomly selected frames so that the model could capture changes in consecutive frames.
3.1.3. Rule-Based Outlier Detection
The algorithms in
Figure 4 are designed for rule-based outlier detection, specifically tailored to the unique features of a golf swing. The coordinate used in this paper is (0, 0) at the top left, with the y-value increasing as you go down, and the x-value increasing as you go to the right. At impact, the ball flies towards the upper right. Algorithm (a), termed the ‘Inversion Detector’, identifies instances where the left and right pose coordinates are swapped. Throughout the entire swing motion, the x-coordinate of the right ankle should never exceed that of the left ankle. Based on this observation, the Inversion Detector flags any instance where the right ankle’s x-coordinate surpasses the left ankle’s x-coordinate. Similarly, algorithm (c) identifies outliers in the lower body by examining the distance between the two ankles. If this distance becomes less than half of the expected separation, the swing is flagged as having an anomaly. Lastly, algorithm (d) focuses on wrist coordinates. A golf swing video shows a motion of holding a golf club with both hands from the beginning to the end of the swing, raising it clockwise, and then swinging it counterclockwise towards the ball. During the golf swing, both hands are always holding the golf club. Since the distance between the two wrists remains constant throughout the swing, any deviation in this distance range is considered an error, and the swing is flagged accordingly by this detector. The flags generated by algorithm (a) are addressed by directly converting the left and right joint coordinates, and the flags generated by algorithms (c) and (d) are deleted.
3.1.4. Interpolating Missing Coordinates
The algorithm for interpolating missing coordinates after outliers are removed through the outlier detection model is represented in
Figure 4e. There are interpolation methods, such as Linear, Next, Previous, and Nearest, and, through performance comparisons of each method, it was found that the Linear method shows high performance. The Next and Previous methods fill the missing space by duplicating the values from the subsequent and preceding data points, respectively. The Nearest method fills the missing space by duplicating the value from the data point that has the closest x-value. However, these approaches simply duplicate the existing value, so these do not reflect the changes between two data points. Linear Interpolation is a method used to estimate the value of f(x) for an x-value between two given points. This method allows for more accurate interpolation of values because it takes into account the movement path and changes over time, reflecting the variations in each frame.
3.2. Golf Swing Analysis Algorithm
This section discusses the process of comparing a professional’s swing with a learner’s swing to identify differences and create a guide based on these insights. To compare two different swing actions, both temporal and spatial alignment are required.
Firstly, spatial alignment involves applying pose normalization to standardize the coordinate sizes of both the expert’s swing videos and the learner’s swing videos, which have been shot at various resolutions. Secondly, temporal alignment entails dividing the swing actions according to the categorization of golf actions, assigning corresponding actions to each frame, and synchronizing the swings of the expert and learner, which may proceed at different speeds.
Next, to find the most similar expert, an embedding vector is constructed using information such as the angles of each joint, shoulder information, gender, and body proportions. By comparing the expert’s embedding and the learner’s embedding, the most similar expert is selected. Subsequently, feature importance analysis is conducted to identify the features needing correction, and these features are converted into human-understandable concepts, such as left elbow angle, shoulder angle, etc. Finally, the information requiring correction is used to create guide text and videos through a template-based generation module. Detailed descriptions of each step are provided below.
3.2.1. Spatial and Temporal Alignment
In this section, we introduce an algorithm for spatially and temporally aligning coordinates to compare golf swing poses that were gathered in various environments. Spatial alignment refers to adjusting coordinates estimated at different resolutions to the same size and location. To do this process, we define the standard coordinate
and height
for normalizing the size and location of the pose. We first calculate the ratio
r by dividing the standard height
by the target height
, which is calculated as the distance between the head and foot coordinates of the target pose. Then, we calculate the distance between the standard coordinates
and the resized root joint
of the target pose to derive the required coordinate shift distances
and
for each axis. Finally, we can acquire the normalized
X of each target joint
i by:
Temporal alignment involves the use of a posture segmentation algorithm to divide the swings of both learners and experts into stages based on the eight distinct actions in golf. The primary goal of this algorithm is to adjust temporal differences between two videos by recognizing each segment of the swing action through action segmentation. In this study, golf positions were categorized into Address, Takeback, Backswing, Backswing-top, Downswing, Impact, Follow, and Finish. We annotated the joint coordinates manually using standard swing posture images representing each posture and calculated the similarity of the user’s entire swing and each action using Euclidean distances. However, due to the nature of golf swing actions, the Address, Takeback, and Backswing positions display similar motions to the Impact and Downswing positions. This similarity causes irregular posture segmentation and prevents the adjustment of the temporal differences between positions by reversing the temporal flow of the swing. In this study, we proposed an algorithm that divides golf swing positions into two segments based on the Backswing-top and Finish positions to distinguish each action. The proposed algorithm is detailed in Algorithm 1, and the flow chart is illustrated in
Figure A1 in
Appendix A.
Algorithm 1 analyzes changes in the y-coordinates of both wrists to detect the Backswing-top and Finish actions in a golf swing, where the wrist has the highest value. When the y-value increases from the preparatory action, the action is considered to have started, and the search for the highest height begins. The algorithm tracks changes, updates the highest y-value, and determines that the Backswing-top action has been found when the value no longer increases and begins to decrease. Afterward, the y-value is updated again until the Finish action appears. Through this process, the golf action is divided into two sections based on the Backswing-top, and the Euclidean distance measurement is used to determine the reference frames for the Address, Takeback, and Backswing actions in the first section and the remaining actions in the second section. In this way, the posture data containing the action segmentation markers for the temporal alignment of golf swing video frames can be obtained. Finally, we are able to obtain the normalized data’s
through the spatial and temporal alignment algorithms.
Algorithm 1: Pose Division Algorithm |
- 1:
Input: Pose keypoints k - 2:
Output: The two dividing point list , The height of two top point h - 3:
function PoseDivision(k) - 4:
Initialize: - 5:
- 6:
for each frame, in enumerate(k) do - 7:
- 8:
if length() , , and == False then - 9:
True - 10:
end if - 11:
if and then - 12:
False - 13:
True - 14:
end if - 15:
if == True and then - 16:
- 17:
- 18:
end if - 19:
if is True and then - 20:
- 21:
- 22:
end if - 23:
() - 24:
end for - 25:
return - 26:
end function
|
3.2.2. Swing Similarity Calculation
We detail a method for comparing the actions of professional golfers and learners using golf swing embedding to find the most similar pro golfer. To accurately represent the golf swing, we employ swing embedding, using the swing style and physical features of the golfer. The swing style is inspired by golf styles classified as hitter and swinger and is characterized by elements that can distinguish each style. We utilize four features of golfer and swing motion: gender, body proportion, the ratio of y-values of two top positions (Backswing-top, Finish), and the body angle at the impact position. To separate two genders in a large difference of similarity, we assign gender values of 0 and in the embedding. The body proportions are calculated to divide all 15 edges of the joints by the height of the golfer. To acquire the relative speed of swing, we calculate the ratio of five positions to the total frame. Because the Impact and Finish positions are fixed motions, Takeback, Backswing, Backswing-top, Downswing, Impact, and Follow are used in the representation of the swing speed. Furthermore, we represent the swing style by calculating the ratio of the y-values of the two top positions and the right knee and right elbow angle at the impact. The generated 24-dimensional embedding is used to compare the swing actions of professional golfers and a learner by their cosine similarity scores and select the most similar pro golfer. This allows learners to refer to the swing of a pro golfer similar to their own swing and proceed with the necessary joint correction.
3.2.3. Golf Swing Pose Correction
After finding a similar golfer, the system identifies joints that need correction to improve the swing action of learners using explainable golf swing embedding. The posture correction algorithm measures core elements of the golf swing, such as shoulder and hip rotation angles, for each action and creates a joint angle embedding. The angle embedding is represented as a 10-dimensional vector, including the angles of both shoulders, hips, elbows, and knees, and the rotation angles of shoulders and hips. Based on this embedding that represents the geometric information of the pose, a cosine similarity score with the most similar pro golfer is calculated for each swing position. To identify the specific points of body parts that show the discrepancy, the impact of each feature of embedding is observed by sequentially removing each embedding element. If the similarity increases when an element is removed, that element has a negative impact on similarity and needs improvement. Conversely, if the similarity decreases, that element has a positive impact on the similarity and needs to be maintained. We can easily convert each feature to a location of the body because the elements of the feature have the semantic information of body parts. By detecting and presenting specific angles that need improvement in the user’s actions in this way, the algorithm supports self-training by users.
3.2.4. Generation of Swing Guide for Self-Training
We created template-based guide texts and videos to help users clearly analyze their swings. The text generation process begins with a brief greeting and presents a summary of the pro player most similar to the user, the similarity scores, the most similar posture, and the actions that need improvement. In addition, for actions that need correction, the guide video sequentially shows the footage of a pro golfer and the user for each segmented action and visualizes the skeleton of the joints that need correction, making it easy to understand at a glance.
Figure 6 presents an example of the generated video. To simplify the differentiation of poses, the pose information is displayed at the top-left corner of the video. Additionally, the video highlights specific joints that require adjustments to align with the swing motion of a professional golfer.
4. Prototype Implementation
We implemented the prototype of the proposed system utilizing the Python Qt5 library version 5.15.9. The implemented system includes simple login functionality, video upload, and guide video creation processes. Through this system, users can upload their golf swing videos and visually compare which parts differ from a professional’s motion. This allows users to self-learn to achieve a swing more similar to that of a professional.
Figure 7 illustrates the prototype where the proposed method is applied.
Users can create an account using basic information (account details, gender, etc.). Upon logging in with the created account, the main page for golf swing diagnosis is displayed. On the main page, users can diagnose their golf swing, view past records, and set options. Options allow users to decide whether they want to strictly view the swing differences between the pro and the learner, view them in a balanced manner, or view them conventionally. Additionally, settings related to the pose estimation model and coordinate interpolation can be adjusted.
After setting the options and pressing the swing diagnosis button, users are directed to the video upload page. Upon setting the file path for the user’s swing video, swing analysis is initiated. In this process, the proposed method identifies the swing of a professional golfer that is most similar to the user’s swing and demonstrates, through video and text generation, which parts need correction to achieve a more similar swing. The text generation offers a template-based presentation of an overall evaluation of the swing, which actions were most similar to those of the pro golfer, and which actions need improvement. The generated video aids the user by visually marking which joints differ, helping users easily understand the required corrections. The created guide videos are accessible through the history page, allowing users to easily access and review past records in the future.
5. Experiments
In this section, we present detailed evaluation results to assess the efficacy of our proposed approach. Firstly, we analyze the quantitative improvement in performance during the pose refinement process, accompanied by various visual aids supporting our findings. Moreover, we demonstrate that our suggested enhancement method is effective not only for 2D coordinate refinement but also for 3D coordinate improvement. Through a qualitative evaluation, we describe the intuitive characteristics of the enhanced pose. Finally, we provide examples of text and videos generated using the proposed self-training posture comparison algorithm, illustrating the effectiveness of our approach. The joint coordinates and their indices used in this paper are depicted in
Figure 8.
5.1. Dataset
To assess performance, we gathered golf swing motion data, manually annotating joint labels on golf swing videos. The collected data encompasses a total of ten distinct swing motions, of which five were used for training and the remaining five for evaluation. For measuring the enhancement performance of 3D coordinates, we utilized the publicly available 3DPW [
26] dataset. Out of the data in the 3DPW dataset featuring individual subjects, ten instances were used for training, and four instances were deployed for testing. To build a database of professional golfers, we collected swing data from 16 players listed on the World Ranking provided by the PGA TOUR [
27]. These data came without joint annotations. Therefore, joint labels were automatically generated using a pose estimation model.
5.2. Metric
To verify performance improvements, we employed the mAP (mean Average Precision) score as a metric. The mAP score is a commonly used evaluation metric in joint detection tasks, assessing the accuracy of the estimated joint coordinates. The AP score utilizes OKS (Object Keypoint Similarity), a normalized distance measurement criterion between the predicted and actual key points. In OKS measurements, threshold values exist; values closer to 1 entail a stricter evaluation, whereas values closer to 0 offer a more lenient assessment. We adopt a standardized approach, averaging the results tested at threshold values of 0.5, 0.05, and 0.95. Additionally, for evaluating 3D coordinates, we also measure the MPJPE score. With the mAP accuracy measurement in 3D coordinates, there is a tendency for the accuracy to significantly drop as the number of prediction axes increases. To counter this, our performance measurement calculates the accuracy using the error between the estimated pose’s root joint distance to the target joint and the ground truth pose’s root joint distance to the target joint.
5.3. The Result of 2D Golf Pose Refinement
In this section, we evaluated our pose refinement method using 2D golf swing data. For our baseline model, we utilized BlazePose [
9]. Through a total of five golf swing motion datasets, we assessed the effectiveness of our proposed method.
Table 1 displays the evaluation results. The higher performance values are bolded. Our experimental results indicate that adding our proposed pose refinement module to the baseline model enhances performance. The notations (a), (c), and (d) in the table represent the rule-based outlier detection methods that utilize the characteristics of golf swings. When we removed the rule-based method and experimented, there were instances in which the performance improved. However, the overall performance declined. This shows that, while our proposed method effectively improves the actual performance, relying on a rule-based approach can result in performance degradation when the scenario deviates from the proposed conditions.
5.4. The Result of 3D Pose Refinement
We evaluated the applicability of our proposed model in 3D coordinates using images from the 3DPW dataset where only a single person appears.
Table 2 presents our experimental results. The data used for evaluation include “courtyard_bodyScannerMotions_00”, “courtyard_jumpBench_01”, “courtyard_relaxOnBench_00”, and “outdoors_freestyle_01”, represented as Pose 3, 6, 9, and 12, respectively. Due to the length constraints of the trained model, each dataset was truncated to 180 frames for evaluation. In the 3D coordinate improvement evaluation, we did not apply the rule-based outlier removal technique that is only applicable to golf motions. In the MPJPE scores, we observed performance improvements in most evaluation results when using our post-process coordinate refinement module. Notably, for Pose 9, we observed a roughly 9% reduction in the MPJPE score. Previous studies faced limitations in the refinement of 3D coordinates because of using an image process approach. On the other hand, our proposed method demonstrates that improvements in three-dimensional coordinates are achievable by leveraging coordinate sequence information.
5.5. The Impact of Interpolation Method
The interpolation methods compared in this experiment are Linear, Cubic, Nearest, Previous, and Next. We conducted the experiment based on the model with the best performance in
Table 2.
Table 3 shows the average interpolation performance comparison for each method with MPJPE score. The Cubic method uses a polynomial to interpolate values. It is sensitive to outliers, meaning that the inclusion of even a single outlier has the potential to significantly decrease the overall performance. In our performance experiments, the interpolation performance was lower than those of the other methods. The other methods performed similarly, but Previous and Next simply copied values, which is not the purpose of this study, and Nearest performed well, but not as well as Linear Interpolation. The Linear Interpolation method is stable in the sense that outliers do not bring down the overall performance, and it has the highest interpolation performance.
5.6. Impact of Coordinate Change Rate Adjustment
The pose refinement method proposed in this paper identifies abnormal changes, not typical joint trajectories, in the frame-by-frame coordinate change rate and considers them as outliers for removal. To understand the impact of this approach on performance improvement, we compared the coordinate change rate graphs before and after refinement.
Figure 9 displays the change rate graphs of each joint before and after refinement. The red graph represents the change rate before refinement, and the blue graph shows the change rate after refinement. Each graph starts with joint 0 at the top and ends with joint 15 at the bottom. We compared the change rates of Pose 3, which showed an increase in the MPJPE score, and Pose 9, which demonstrated the most significant reduction in the MPJPE score. The proposed results reveal that Pose 3 still had many spikes in the change rate after refinement, indicating missed detection in regions of abrupt change. In contrast, for Pose 9, the post-refinement change rate showed significant stabilization. Consistently, Pose 9 achieved more substantial pose accuracy improvement than Pose 3. Through these results, we confirmed that stabilizing the change rate can contribute to enhancing the accuracy of coordinates.
5.7. Swing Pose Analysis Results
Figure 10 illustrates the results of comparing the similarity between the swings of a professional golfer and a user through a similarity comparison method (b) and measuring the influence of each feature on the pose with the lowest similarity (c) with its visualization shown in (a). From the results in (b), we can see that during the backswing phase, the user’s swing motion has the least similarity to that of the professional. This similarity measure can help determine which movements should be corrected first and which ones are most similar. The influence of each joint feature during the backswing phase with the lowest similarity is presented in (c). By removing each feature and measuring its impact, if a feature has a negative value, it implies that it has a significant influence on pose similarity and can be interpreted as being similar to the professional’s motion. Conversely, if a feature has a positive value, it suggests that removing this feature angle improves the similarity, indicating that corrections are necessary. The left shoulder angle with the highest positive value is visualized in (a) with a yellow circular marker, showing a noticeable difference from the actual motion of the professional. We can also observe positive values in the shoulder and hip angles, signifying a difference in the shoulder and hip angles compared to the professional. From these results, we can conclude that our Swing Embedding method is intuitive and effective in identifying real swing differences.
6. Limitations
In our study, we have proposed a method to enhance golf swing analysis. However, there are inherent limitations that need to be addressed. One primary concern is the distinct separation between our outlier detection and removal process and the interpolation process. This separation means that if one module does not function optimally, it could potentially compromise the efficacy of the entire system. For instance, accurate outlier detection, when followed by a sub-optimal interpolation, might lead to results that are not as reliable as when no interpolation is used. An integrated approach that combines both outlier removal and interpolation in an end-to-end framework may be more aligned with the desired objectives.
Additionally, our system’s effectiveness is heavily reliant on the performance of the pose estimation model. If this model does not produce accurate results, it could significantly affect the quality of the guidance provided. Challenging conditions, such as low light or situations where the subject does not contrast well with the background, can decrease the precision of the joint coordinate detection. Such limitations can affect the user experience with the swing guide. Future work should focus on refining the system to capture swing motions effectively in diverse environments.
7. Conclusions
In this paper, we introduced pose refinement methodology and a golf swing analysis system based on explainable swing embedding for self-training. Our approach for pose refinement utilizes the changes in coordinates per frame for detecting biased pose joints. Because this method uses sequential coordinate information of the coordinates, we can apply it not only to 2D poses but also to 3D poses. Additionally, we demonstrated through these findings that we can refine the human pose estimation result by reducing the sharp changes in coordinates. Furthermore, we proposed a swing embedding method using the geometric information extracted from the swing pose. Our embedding method not only can compare the similarity of two golf swing poses but also can visualize the different points because the features of the embedding vector consist of intuitive information, such as the angle of the shoulder. Consequently, the case study showed that our swing guide system for self-training can appropriately suggest the specific body point that needs to be fixed to become more similar to the pro golfer’s swing. Our proposed system can be utilized in an application service for a user who wants to study golf swing with a low-cost and time-efficient approach.