4.1. Implementation Details
4.1.1. Dataset
The TotalCapture dataset [
42] is one of the most popular datasets, containing ground truth 3D human pose data from IMUs and images. This dataset utilizes eight cameras and 13 IMUs to capture human pose information. It includes individuals of different genders performing four different actions: range of motion (ROM), walking, performing, and freestyle. Each action is repeated three times. In our experiments, for efficiency, we utilized four cameras (cameras 1, 3, 5, and 7) and eight IMUs (placed at the center positions of the limbs), as shown in
Figure 1.
The TotalCapture dataset was divided into training and testing sets [
38]. The training set includes subjects 1, 2, and 3, encompassing the Room 1, 2, and 3, Walk 1 and 3, and Freestyle 1 and 2 sequences. The testing set comprises Walk 2, Freestyle 3, and Acting 3 from all subjects. SimpleNet was utilized for training on the training set, while model performance was assessed on the testing set. Object relation network (ORN) and orientation regularized pictorial structure model (ORPSM) were employed to evaluate model performance.
This paper validates the generalizability of the model on the Human3.6M dataset [
43], which includes 3.6 million 3D human poses and corresponding images from 11 professional actors (six males and five females) in seven scenes: discussion, smoking, taking photos, talking on the phone, daily activities, leisure activities, and social interaction. The dataset was captured using four calibrated high-resolution 50 Hz cameras. In 2D pose estimation, this paper uses subjects 1, 5, 6, 7, and 8 for training and subjects 9 and 11 for testing. Since this dataset does not provide IMU data, we created virtual IMUs (limb orientations) using the ground truth 3D poses for both training and testing, and we only show proof-of-concept results.
4.1.2. Training Method
In the training of 2D pose estimation, the adjustable hyperparameters include learning rate, batch size, regularization parameter, neural network architecture parameters, optimizer, number of iterations, rotation factor, and scaling factor. The learning rate was set to 0.001, with a decay of 10 times at epoch 25 and another 10 times at epoch 30. The study used an RTX 4090 GPU with 24 GB VRAM and a 2TB SSD. During the training of ResNet-152, the batch size was 16, with a speed of 210 samples/s for SN + Lp, 59 samples/s for ORN+, and 11 samples/s for ORPSM+. The regularization parameter controls the strength of the regularization penalty, and we used L2 regularization. Appropriate regularization parameters can improve the model’s generalization ability and prevent overfitting. We used a ResNet152 with 152 layers and Adam optimizer to adjust the model’s weights to minimize the loss function. For the rotation factor and scaling factor, we set them to 45° and 0.5, respectively.
In order to optimize the learning rate of the model, this method is used to adjust in the range from 0.1 to 0.0001. First, we set the initial minimum learning rate to 0.0001, and the maximum learning rate was set to 0.1. Model training was performed by calculating the intermediate value , and the accuracy of the test set was recorded. The search range was adjusted according to the performance of the test set. If accuracy improves, the minimum learning rate is . If the accuracy does not improve, then update the maximum learning rate of . Repeat this process for 10 iterations to gradually close the gap between and . Finally, the learning rate with the highest accuracy in the test set was selected as the final learning rate value, namely 0.0001. This paper adopts an adaptive learning rate adjustment strategy where, during training at the 20th and 25th epochs, the learning rate is reduced by a factor of 10 when the test set accuracy stabilizes or no longer improves, aiming to prevent model overfitting.
In order to enhance the diversity of the training data, random image rotations were employed, allowing the model to better adapt to various object poses at different angles. A rotation range of 45 degrees covers a wide-angle variation, thereby improving the model’s robustness in handling object rotations. This setting also prevents issues such as the misidentification of limbs (e.g., left identified as right) when exceeding a 45-degree rotation. Additionally, a scaling factor of 0.5 was applied during training, randomly resizing images. This helps the model learn and adapt to objects of different scales, ensuring better performance when dealing with objects of varying sizes.
4.1.3. Evaluation Criteria
In 2D human pose estimation, we used the percentage of correct keypoints (PCK) metric as the evaluation criterion. PCKh@t measures the percentage of estimated keypoints with a distance less than times the head length compared to the ground truth keypoints. In the experiments, we provided results for different t values, where t was set to 300 mm. Specifically, we used t = one-half, one-sixth, and one-twelfth, examining the corresponding percentages by using these different thresholds. For 3D human pose estimation, mean per joint position error (MPJPE) is a classic metric that calculates the average distance error between predicted keypoints and ground truth keypoints. MPJPE is measured in millimeters. By reporting the average error (in millimeters) between all predicted keypoints and ground truth keypoints, we can evaluate the accuracy of our ORPSM+ in 3D pose estimation.
4.2. Two-Dimensional Human Pose Estimation Experiment
Table 1 mainly presents the variations of SimpleNet (SN), including SN50 with the baseline resnet50, SN152 with the baseline resnet152, and SN152 + Lp using the Laplace distribution function. Compared to the SN50 model, the SN152 model has slightly lower Mean (All) values at a threshold of one-half, with a decrease of 0.1% (99.0% for SN152 vs. 99.1% for SN50). At a threshold of one-sixth, both SN152 and SN50 have the same Mean (All) value of 90.7. At a threshold of one-twelfth, SN152 outperforms SN50, with a 0.4% higher Mean (All) value. By comparing the results at different thresholds, we can observe that the deeper network architecture of SN152 does not significantly affect the average joint recognition accuracy at thresholds of one-half and one-sixth. However, at a threshold of one-twelfth, SN152 achieves a higher average joint recognition accuracy compared to SN50. Therefore, we chose the SN152 model for 2D human pose estimation on all images. When comparing SN152 + Lp with SN152, the former achieves higher average joint accuracy, with improvements of 0.2%, 1.3%, and 2.0% at thresholds of one-half, one-sixth, and one-twelfth, respectively. This indicates that incorporating the Laplace distribution helps improve overall accuracy, especially at smaller thresholds. Therefore, we conclude that using the SN152 model for 2D human pose estimation provides favorable performance, and further enhancements can be achieved by leveraging the Laplace distribution, particularly at lower thresholds.
Figure 5 is a three-fold plot of the average joint point accuracy values at thresholds of one-half, one-sixth, and one-twelfth for the 2D human pose estimation ORN model using different w. According to the figure, when the threshold is one-half and one-sixth, the broken line tends to be smooth; when the threshold is one-twelfth, w takes 0.4–0.7 values as the common maximum. At three different thresholds, both 0.6 and 0.7 are the largest. Here, we use 0.7 as the optimal solution; however, the accuracy is 99.5%, 93.8%, and 77.4% when the threshold is one-half, one-sixth, and one-twelfth, respectively.
According to
Table 2, the average accuracy per precision ((PCKh@) and (PCKh@)) for all joints are one-half and one-sixth using ORN+ and one-twelfth, respectively. ORN and ORN+ are averaged across all joints; they have the same accuracy at one-half and one-sixth, and ORN+ is 77.4% better than ORN at one-twelfth. It follows that using ORN+ relative to ORN can make the network smoother while improving the accuracy at small thresholds.
In
Table 3, SN152 + Lp refers to the 2D human pose model that uses ResNet152 and the Laplacian distribution function,
is the 2D human pose model that uses IMU to enhance the keypoints for the same view, and ORN+ is the 2D human pose model that uses IMU to enhance the keypoints for multiple views. ORN+ has a higher accuracy (PCKh@) than SN152 + Lp and
under different thresholds. At a threshold of one-half, the Mean (All) accuracy of ORN+ is 0.3% and 0% higher than SN152 + Lp and
, respectively. At a threshold of one-sixth, the Mean (All) accuracy of ORN+ is 1.8% and 0.5% higher than SN152 + Lp and
, respectively. At a threshold of one-twelfth, the Mean (All) accuracy of ORN+ is 4.8% and 1.8% higher than SN152 + Lp and
, respectively. By comparing the accuracy at different thresholds, ORN+ performs the best, and the accuracy improves more at smaller thresholds. At a threshold of one-half, the Elbow keypoint has the highest accuracy under
, which is due to the errors in the IMU; the weighted averaging function mainly maintains the accuracy at a threshold of one-half and improves the sensitivity of the model at smaller thresholds of one-sixth and one-twelfth.
As shown in
Table 4, the models trained on the Human3.6M dataset were evaluated using the SimpleNet architecture with ResNet152 (SN152) and ResNet152 with Laplace connection (SN + Lp) at different thresholds (one-twelfth, one-sixth, and one-half). At thresholds of one-twelfth and one-sixth, SN + Lp achieved average accuracies of 72.7% and 92.2%, respectively, outperforming SN152. At the threshold of one-half, the accuracy of SN + Lp was slightly lower by 0.1% compared to SN152. This suggests that under looser thresholds, SN + Lp performs slightly less well than SN152, but the difference is minimal. Overall, SN+ shows the best performance, indicating that the introduction of the Laplace connection positively impacts the model’s performance on the Human3.6M dataset, particularly under stricter thresholds (one-twelfth and one-sixth).
Table 4 presents the evaluation results of the models trained on the Human3.6M dataset. The SN152 + Lp model uses ResNet152 as the backbone network and incorporates a Laplace distribution function. The
model employs a regularized network evaluated under a single-view setup, while the ORN+ model is evaluated under a multiview setup. Compared to SN+ and
, all ORN+ models achieved the highest average keypoint accuracy across different thresholds, with values of 97.2%, 94.9%, and 83.1%, respectively. Overall, the ORN+ model demonstrates superior performance in human pose estimation tasks, particularly in multiview scenarios. This has significant implications for improving the performance and robustness of pose estimation systems, especially when handling multiview data.
When combining the results from
Table 3,
Table 4 and
Table 5, it is evident that the SN+ and ORN+ models achieve the highest average keypoint accuracy at thresholds of one-half, one-sixth, and one-twelfth, thereby validating the generalization performance of the proposed model.
Figure 6 shows the output of 2D human pose visualization using our SN152 + Lp model and the ORN+ model across four cameras.
4.3. Three-Dimensional Human Pose Estimation Experiment
We first evaluated our 3D pose estimator via extensive ablation experiments, and we compared our method to state-of-the-art methods. As shown in
Figure 7, the MPJPE values for the average 3D joints in the TotalCapture dataset were sampled from 0 to 1 in intervals of 0.1, indicating that the output MPJPE is minimized when w is 0.7, 0.8, and 0.9.
Figure 7 mainly uses the ORPSM+ model to change the w parameter of the limb length constraint function by taking w values from 0 to 1 and taking 11 numbers per interval of 0.1. If we observe the output MPJPE values with different w values, we can determine the optimal solution of the basic expectation, w. When taking 0.7, 0.8, and 0.9 for w, the minimum value for MPJPE is 20.7 mm. We finally use 0.7 as the base expectation. For w = 0, under the ORPSM, the error of MPJPE is reduced to 0 = 7 mm relative to w = 0 and w = 0.7.
According to
Table 6, SN152 refers to the ResNet152 network used under the SimpleNet (SN) baseline; SN152 + ORN is the fusion 2D pose model that uses the SN152 baseline and combines multiple views and IMUs using ORN; SN152 + Lp + ORN refers to the model that uses the SN152 network and Laplacian probability distribution function and ORN model; SN152 + Lp + ORN+ refers to the ORN model that uses the SN152 network, a Laplacian probability distribution function, and a weighted averaging function. ORPSM refers to the ORPSM+ model using the basic expectation w = 0 for 3D human pose estimation with limb length and limb direction constraints. The SN152 + Lp + ORN+ model performs the best in 2D, with an error of 20.7, reducing the Mean (All) error by 0.7 mm compared to using ORPSM, and reducing the error of the Others keypoints by 0.3 mm. When using the ORPSM model for 3D, the Others keypoints have the smallest 3D error under SN152, mainly because Others include root, belly, neck, and nose keypoints, which already have high accuracy in 2D with the SN152 model. Using the multiview and IMU-based ORN model introduces errors and reduces accuracy, especially for 3D models. However, we used ORN mainly to improve the accuracy of limb joints, such as Hip, Knee, Ankle, Shoulder, Elbow, and Wrist, which have lower accuracy, as predicted by the SN152 model. Moreover, using ORN and ORN+ does not significantly reduce the accuracy of Others keypoints. Overall, Mean (All) error and 3D keypoint MPJPE are reduced when using ORN and ORN+ when keeping the 3D model unchanged.
According to
Table 7, the overall average error (Mean) without performing aligned processing using IMU is 20.7 mm, ranking second for the TotalCapture dataset. Compared to the baseline [
15], the model used in this paper reduces MPJPE by 3.9 mm. At the same time, our S4,5 walking A3 and free motion FS3 rank first for the TotalCapture dataset. Our models, including the nonfused SN152, the fused ORN+ in 2D, and the ORPSM+ model in 3D, are effective. In our model, the aligned process is used as a post-processing step, where the predicted keypoints are aligned with the true keypoints through rotation and translation to obtain the aligned keypoints.