1. Introduction
The results of a worldwide survey of fitness trends in 2019 have shown that wearable technology will continue to be number one [
1]. This trend has been observed since 2016. However, many consumers are not aware of the fact that wearable devices sometimes make inadequate and inaccurate predictions regarding the measurement accuracy [
2]. Nowadays, a wide range of consumer wearable devices are available. The term wearable devices, which is often abbreviated as wearables, usually refers to small computer-controlled systems that are worn in, on, and close to the body. They are often equipped with a variety of sensors (e.g. accelerometers, gyroscopes, magnetometers, pulse oximeters). These sensors enable the devices to collect information about their immediate environment and therefore to monitor physiological signals such as number of steps, heart rate, quality of sleep, sleep rhythm, energy expenditure (EE), and maximum oxygen uptake (VO
2max) [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14]. In the rapidly growing field of wearables, tattoos and subcutaneous implants have also become a subject of research. However, the best-known and most common wearable devices are wrist-worn activity trackers. They represent one of the simplest and most cost-effective ways of monitoring various physiological parameters. Despite the high number of wrist-worn activity trackers [
14], there is a lack of scientific validation studies. Energy expenditure seems to be the most critically examined physiological parameter. In Evenson et al. [
2], a systematic summary of the validity and reliability of activity trackers is provided. This review includes validation studies on activity trackers from the Fitbit and Jawbone brands. For EE, almost all trackers demonstrate good to excellent reliability (Intraclass Correlation Coefficient, ICC = 0.74–0.97). However, they do not provide valid results. Estimation of EE is usually significantly underestimated. Further studies confirm that a lot of wrist-worn activity trackers show insufficient validity and reliability concerning EE [
3,
4,
5,
6,
7,
9,
10,
11,
13,
15,
16]. Boudreaux et al. [
16] investigated the validity of EE during cycling and resistance exercise. Thereby, none of the tested devices showed valid results. Besides the general validity, Wahl et al. [
15] examined the influence of running pace on the EE. The study shows a significant influence of running pace on the estimated EE. Energy expenditure tends to be overestimated at lower pace and underestimated at higher pace. Equally, Roos et al. [
5] examined the validity of estimated EE during running. They concluded that metabolism significantly influences the estimation of EE. In the aerobic range, EE was both over- and underestimated, whereas in the anaerobic range, the tested sports watches significantly underestimated EE by 21.6% to 49.3%, respectively. Woodman et al. [
3] also reported significant mean absolute percentage errors of up to 64% in EE. Concerning the VO
2max estimation of wrist-worn activity trackers, a small number of scientific validation studies have been published so far. To our knowledge, Kraft & Roberts [
8] and Snyder et al. [
17] are the only scientific validation studies concerning the prediction of the maximum oxygen uptake by means of activity trackers. Kraft & Roberts [
8] tested the accuracy of VO
2max prediction of the Garmin Forerunner 920XT. Thereby, the sports watch does not show significant differences in comparison with spirometry. Snyder et al. [
17] investigated the validity of VO
2max prediction of three sports watches—Polar V800 and Garmin Forerunner 230 and 235. They showed significant differences in comparison with the gold standard and observed a significant influence of gender. Based on the differing results of the above-mentioned validation studies, estimations of EE and VO
2max should be regarded with skepticism and caution. In general, these publications and their results underline the necessity for comprehensive scientific validation of wearable devices.
Performance-specific misjudgments of activity trackers can lead to an increased risk of injury due to overload. Consumers must be protected, especially if activity trackers are to be used increasingly in the health sector and if they are granted increasing access to our society. However, this access can only be defended or considered responsible if the current lack of transparency of the activity trackers industry is remedied through high-quality research, which can also help define general standards for these devices. Henceforth, the interdisciplinarity of different fields, especially sports science, medical technology, and ergonomics, but also standardization, is in demand.
The aim of the study was to clarify the validity of wrist-worn activity trackers. Both, the prediction of the VO2max and EE are physiological parameters used for training control in sports and for support of obesity treatment. Therefore, they are linked to both physical activity and healthy lifestyle. The purpose was to assess whether the activity trackers represent a valid alternative to the respective gold standard method in terms of predicting EE and VO2max and whether they can be used without hesitation in the health sector or for training control in sports.
3. Results
Characteristics of the sample population are shown in
Table 1.
3.1. VO2max
The results of the descriptive examination of the differences between the measured VO
2max (CM3B) and the estimated VO
2max of the investigated sports watches GF920XT and PV800 are provided in
Table 2.
On average, participants achieved a value of 50.3 ± 8.1 ml·kg−1·min−1 in spirometry. The average estimated VO2max of GF920XT and PV800 is 48.1 ± 6.5 ml·kg−1·min−1 and 53.2 ± 10.5 ml·kg−1·min−1, respectively. The results of t-tests comparing VO2max from sports watches with a spirometry device show significant underestimations by the GF920XT (t(23) = –2.37, p = 0.027), whereas the PV800 indicates no significant tendency to overestimate the VO2max (t(23) = 1.89, p = 0.071). Even though both devices are quite similar in terms of mean absolute errors (MAE), a MAPE of 13.2% (PV800) and 7.3% (GF920XT) was determined, respectively. Moreover, GF920XT and PV800 show moderate to good agreement (ICC) in comparison with CM3B (GF920XT: 0.82; PV800: 0.67) and high internal differences in variance (GF920XT: 42.1; PV800: 109.6).
Figure 1 shows Bland-Altman plots of the sports watches PV800 and GF920XT in comparison with spirometry with CM3B.
These plots serve as a visual illustration of scattering and over- or underestimated measurement ranges of the investigated sports watches. The plots indicate the differences of the VO2max values on the y-axis relative to the mean of the two methods (spirometry and alternative method) on the x-axis. Mean differences (bias) between estimated VO2max and VO2max of spirometry, upper and lower limits of agreement (ULoA, LLoA) are labeled in the plots. Limits of agreement (LoA) were calculated as means ± 1.96 x SD. Both sports watches show considerable deviations in scattering, when compared with spirometry. The plots illustrate the PV800’s tendency to overestimate (bias: 3.0 ml·kg−1·min−1) and the GF920XT’s tendency to underestimate (bias: –2.1 ml·kg−1·min−1) the VO2max, respectively. Furthermore, the differences in variance are visualized. The PV800 (ULoA-LLoA: 30.2 ml·kg−1·min−1) shows higher scattering amongst its measures when compared with the GF920XT (ULoA-LLoA: 17.2 ml·kg−1·min−1).
3.2. Energy Expenditure
The results of the descriptive examination of the differences between the measured EE (CM3B) and the estimated EE of the investigated fitness trackers TTT, GVHR, WPO
x with and without the adjustment of the subjective assessment of the participants’ physical strain are provided in
Table 3.
On average, participants achieved an EE of 125.5 ± 35.3 kcal in spirometry. The average estimated EE of TTT is 130.0 ± 23.2 kcal, of GVHR is 139.8 ± 28.8 kcal, of WPOx without adjustment is 121.8 ± 24.4 kcal, and of WPOx with adjustment is 121.5 ± 22.0 kcal, respectively. Based on the results of the t-test, the GVHR significantly overestimates the EE (t(23) = 2.44, p = 0.023). The fitness trackers TTT (t(23) = 0.93, p = 0.363), WPOx without adjustment (t(23) = –0.54, p = 0.590), and WPOx with adjustment (t(23) = 0.90, p = 0.377) indicated no significant tendency to underestimate or overestimate the EE.
This results in an MAPE of 18.2% (TTT), 23.9% (GVHR), 20.1% (WPOx without adjustment), and 14.2% (WPOx with adjustment), although the MAE of the fitness trackers, except for the GVHR, were very similar. Moreover, the fitness trackers show low to moderate agreement (ICC) in comparison with CM3B (TTT: 0.68; GVHR: 0.60; WPOx without adjustment: 0.40; WPOx with adjustment: 0.72) and exhibit high internal differences in variance (TTT: 538.0; GVHR: 831.9; WPOx without adjustment: 595.7; WPOx with adjustment: 482.8)
Figure 2 shows Bland-Altman plots of the tested fitness trackers in comparison with spirometry. These plots serve as a visual illustration of scattering and over- or underestimated measurement ranges of the investigated fitness trackers.
The plots indicate the differences of the EE values on the y-axis relative to the mean of the two methods (spirometry and alternative method) on the x-axis. Mean differences (bias) between estimated EE and EE of spirometry, upper and lower limits of agreement (ULoA, LLoA) are labeled in the plots. Limits of agreement (LoA) were calculated as means ± 1.96 x SD. The fitness trackers show considerable deviations in scattering, when compared with spirometry. The plots illustrate the TTT’s and GVHR’s tendency to overestimate (bias TTT: 4.5 kcal; bias GVHR: 14.3 kcal) and the WPOx’s tendency to underestimate (bias WPOx without adjustment:-3.7 kcal; bias WPOx with adjustment: -4.0 kcal) the EE, respectively. Furthermore, the differences in variance are visualized. The WPOx with adjustment shows the lowest scattering (ULoA-LLoA: 86.1 kcal), whereas the WPOx without adjustment indicates the highest scattering (ULoA-LLoA: 130.2 kcal) amongst its measures. GVHR (ULoA-LLoA: 112.3 kcal) and TTT (ULoA-LLoA: 94.1 kcal) are between the WPOx with and without adjustment.
4. Discussion
The present study examined the validity of VO2max and EE estimations of various wrist-worn activity trackers.
The validity of the devices was determined by four methods. Using the MAPE, systematic differences should be assessed. According to Nelson et al. [
6], activity trackers should not exceed a 10% error deviation (MAPE) from the gold standard in order to be considered accurate. The GF920XT achieved this condition (7.3%). The PV800 (13.2%), WPO
x with adjustment (14.2%), TTT (18.2%), WPO
x without adjustment (20.1%), and the GVHR (23.9%) exhibit greater deviation errors. The results of t-tests comparing estimated VO
2max and EE from the activity trackers, respectively, with spirometry indicated that the GF920XT significantly underestimates the VO
2max and the GVHR significantly overestimates the EE. The other devices did not show any significant differences in comparison with the gold-standard method. To investigate the level of agreement between the activity trackers and the gold standard, Bland-Altman plots were prepared according to Bland & Altman [
33]. Concerning the VO
2max, the GF920XT reveals a narrower 95% limit of agreement than the PV800 (ULoA-LLoA (GF920XT): 17.2 ml·kg
−1·min
−1; ULoA-LLoA (PV800): 30.1 ml·kg
−1·min
−1) and therefore visualizes the differences in variance. The plots of EE revealed the narrowest 95% limits of agreement for the WPO
x with adjustment (ULoA-LLoA: 86.2 kcal). The WPO
x without adjustment indicates the widest 95% limits of agreement with a difference of 130.2 kcal.
To determine the level of agreement, the ICCs between the activity trackers and the spirometry were examined. Sports watches demonstrate a good (GF920XT) and a moderate (PV800) level of agreement, respectively. The fitness trackers TTT and WPOx with adjustment indicate a moderate agreement with the gold standard. To summarize, one of the activity trackers shows a good level of agreement (GF920XT), and three out of six activity trackers do not (PV800, TTT, WPOx with adjustment). Concerning the validity of the activity trackers, GF920XT and GVHR show significant deviations to the gold standard. Although the other devices indicate no significant differences, they still have considerable deviations in dispersion and measuring range, which should be included in the decision regarding their validity. A lower bias, a lower MAPE, and a better level of agreement of WPOx with adjustment compared with WPOx without adjustment indicate that an additional subjective estimation of the user’s physical strain mostly leads to data that are more accurate. The systematic differences (MAPE) and the range between the limits of agreement of the examined activity trackers are considered to be too substantial. Therefore, their use as an alternative to the gold standard method is questionable.
Both tested sports watches exceed the absolute error value of 10%, which was suggested by Fokkema et al. [
12]. Thus, they must be considered as too inaccurate to recommend them without any concerns regarding their user’s health for general purposes neither in sports, nor in health care and rehabilitative applications. Even though both sports watches seem to be more likely to underestimate an individual’s maximum oxygen uptake, they still sometimes overestimate by a lot. This could lead to harmful situations in one’s health, especially in less experienced users.
Concerning the VO
2max estimation of wrist-worn activity trackers, a small number of scientific validation studies have been published so far. To our knowledge, Kraft & Roberts [
8] and Snyder et al. [
17] are the only scientific validation studies concerning the prediction of the maximum oxygen uptake by means of activity trackers. Kraft & Roberts [
8] tested the accuracy of VO
2max prediction of the Garmin Forerunner 920XT. Thereby, the sports watch does not show significant differences in comparison with spirometry. Therefore, they estimated the sports watch to be an accurate device to determine the maximum oxygen uptake. This contrasts with the results of the presented study. Although the presented study shows a MAPE ≤ 10% and a good agreement (ICC) in comparison with CM3B, the estimation of the VO
2max by the Garmin Forerunner 920XT was not sufficiently valid. This can be justified by significant differences to the spirometry and high internal differences in variance. Kraft & Roberts [
8] did not provide any power analysis of their statistics, nor did it calculate systematic differences (MAPE). The author’s assessment regarding the validity of the Garmin Forerunner 920XT exclusively refers to the results of the paired samples t-test. Thus, their conclusions should be considered critically. Snyder et al. [
17] investigated the validity of VO
2max prediction of three sports watches—Polar V800, Garmin Forerunner 230 and 235. They showed significant differences in comparison with the gold standard. Thus, the authors concluded that these sports watches should be used carefully for exercise prescription. This corresponds to the overall tendency of the present study. However, in detail, the present study shows no significant differences between the VO
2max determinations of PV800 and spirometry. Nevertheless, according to a MAPE > 10% and a moderate agreement (ICC) in comparison with CM3B, the PV800 was considered as too inaccurate. Moreover, in Snyder et al. [
17], the maximum oxygen uptake estimations differed for men and women. In females, the Polar V800 significantly overestimated the VO
2max, whereas in males, the PolarV800 significantly underestimated the VO
2max, respectively. Differences between men and women were not considered in the presented study. However, one weakness of the study of Snyder et al. [
17] is the missing analysis of the MAPE values and the missing power analysis. Leboeuf et al. [
34] examined the accuracy of maximum oxygen uptake prediction of an in-ear sensor. The analysis of systematic differences between the sensor and the gold standard shows a MAPE of 3.2 ± 7.3%. Therefore, the authors concluded that the in-ear sensor is accurate, despite insufficient statistical confirmation.
The suitability of fitness trackers for EE estimation in the health sector, e.g. for support of obesity treatment, is not given. Measured values regarding the EE can be used as a rough assessment. If exact values are needed, indirect calorimetry should be preferred. In Evenson et al. [
2], a systematic summary of the validity and reliability of activity trackers is provided. This review includes validation studies on activity trackers from the Fitbit and Jawbone brands. For EE, almost all trackers demonstrate good to excellent reliability (ICC = 0.74–0.97). However, they do not provide valid results. Estimation of EE is usually significantly underestimated. Further studies confirm that a lot of wrist-worn activity trackers show insufficient validity and reliability concerning EE [
3,
4,
5,
6,
7,
9,
10,
11,
13,
15,
16]. Boudreaux et al. [
16] investigated the validity of EE estimation during cycling and resistance exercise. Among other devices, they examined the accuracy of the Garmin Vivosmart HR (GVHR) and TomTom Touch (TTT). During resistance exercise, estimation of EE from both fitness trackers had weak intraclass correlation. The GVHR showed the strongest correlation (R = 0.18), whereas the TTT indicated the lowest correlation value (R = 0.02). Additionally, both devices had high MAPE values (GVHR: 57.02%; TTT: 51.64%) during resistance exercise. During cycling, both devices showed high MAPE values (GVHR: 63.05%; TTT: 41.27%) and had weak correlation (GVHR: R = 0.41; TTT: R = 0.30). Based on the t-test, the MAPE, and the ICC, the authors concluded that neither the GVHR nor the TTT represent a valid alternative to the metabolic analyser as the gold-standard method. This conclusion is in accordance with the results of the presented study. However, in detail, there are some differences. The MAPEs of TTT and GVHR show considerably lower values (TTT: 18.2%; GVHR: 23.9%) than in Boudreaux et al. [
16]. In addition, the comparison of t-tests indicates different results. In the present study, the TTT does not significantly differ from CM3B, whereas in Boudreaux et al. [
16], TTT significantly overestimated EE during the resistance exercise as well as during graded exercise cycling. Exclusively, the GVHR significantly overestimated EE compared with CM3B. Regarding resistance exercise, this is in line with the results of Boudreaux et al. [
16]. Because of different study designs, the results of the presented study and the study of Boudreaux et al. [
16] are not fully comparable.
In general, these results substantiate the conclusions of Evenson et al. [
2], indicating a low validity for EE estimation in 10 adult studies. Besides the general validity, Wahl et al. [
15] examined the influence of running pace on the EE. The study shows a significant influence of running pace on the estimated EE. Energy expenditure tends to be overestimated at lower pace and underestimated at higher pace. To summarize, the authors concluded that most of the tested activity trackers could be assumed as not valid. Equally, Roos et al. [
5] examined the validity of estimated EE during running. They concluded that metabolism significantly influences the estimation of EE. In the aerobic range, EE was both over- and underestimated, whereas in the anaerobic range the tested sports watches significantly underestimated EE by 21.6% to 49.3%, respectively. The results of Woodman et al. [
3] regarding the accuracy of activity trackers for estimating EE are in accordance with the above mentioned studies [
2,
15,
16]. Almost all of the tested activity trackers showed significant differences from the measured EE. Woodman et al. [
3] tested the accuracy of EE estimation of Withings Pulse O
x, as well. The MAPE was 64%. The MAPE of WPO
x found in the present study was considerably lower. MAPE values were 14.2% (with adjustment) and 20.1% (without adjustment), respectively. That variance may be caused by a different study design. In contrast to the present study, Lee et al. [
7] examined the validity of EE estimated from a variety of consumer-based activity trackers under free-living conditions. However, there are no differences in the validity of the results. Even in this study, the activity trackers could not show sufficient validity. All tested activity trackers had a mean absolute percentage error of ≥10% compared with the gold standard method, except the BodyMedia FIT. Based on the differing results of the above-mentioned validation studies, estimations of EE and VO
2max should be regarded with skepticism and caution.
There were also some limitations to this study. Based on the paired sample t-test, it cannot be conclusively said if the activity trackers are valid, because the values determined in a power analysis exceed the beta error. To reach power levels > 0.8, a sample size of > 30 is recommended. Although the estimations were mostly not significantly different from measured EE and VO2max, the activity trackers still show considerable deviations in dispersion and measuring range. This effect can possibly be explained by a low to moderate effect strength. Moreover, an investigation of the reliability of the tested fitness trackers and sports watches by performing repeated sets of measurements should be considered. Another limitation of the present study is the performance under controlled laboratory conditions. Thus, the results can be restrictedly transferred to everyday life. For clarification, it would be useful to conduct a broad field study.