1. Introduction
Physical activity patterns correlate with one’s health status and can be used to obtain information about the individual’s health profile. For example, people recovering from major surgeries may move less than what is typical for them; the duration of the changed activity patterns can provide information about the patient’s recovery trajectory [
1]. Similarly, an increase in purposeless movement (e.g., pacing and inability to sit still) may be a symptom of depression [
2]. Thus, it may be beneficial to monitor the relevant daily activities of people at risk of developing health conditions. This kind of information has traditionally been gathered by having patients take surveys, complete interviews, or write diaries [
3]. Although these self-reports, as firsthand accounts, are beneficial, they also have some limitations [
4]. For example, self-reported data are often questioned due to their natural proclivities toward bias: patients may downplay certain tendencies because they like to be viewed as “normal”. Self-reports may overestimate exercise levels for a “good social image” [
5]. Patients can also provide inaccurate reports unintentionally because human memory is prone to mistakes [
6].
Researchers have sought to find new objective ways of collecting more reliable physical activity data to complement the self-reported data. The rise in smartphone adoption and usage offers a unique opportunity to revolutionize patient health status monitoring in research settings and clinical practice. Built-in smartphone sensors, such as the GPS, accelerometer, gyroscope, or magnetometer, can track location and movement continuously and unobtrusively. These
in situ data can be collected to objectively quantify daily activities. Smartphone data collection does not require outfitting patients with additional instruments and, thus, can be conducted over long periods of time [
7]. Smartphones are also widely accessible to the population. Based on surveys by the Pew Research Center, as of 2021, about 85% of U.S. adults own smartphones, which is almost 2.5 times the percentage from 10 years ago [
8]. The field of digital phenotyping has emerged to take advantage of this new technological breakthrough and the vast amount of smartphone sensor data. Digital phenotyping is defined as the “moment-by-moment quantification of the individual-level human phenotype
in situ using data from smartphones and other personal digital devices” [
9]. This approach uses smartphones to capture high-throughput data to learn about cognitive, behavioral, and social phenotypes in free-living settings.
Human activity recognition (HAR) using smartphones has proliferated in recent years [
10]. The first component of HAR is data collection, which requires careful thought about various questions, such as choosing the appropriate sensors, sampling frequency, study environment, and smartphone placement. Some studies use a single sensor [
11,
12], while other studies simultaneously utilize multiple sensors [
13,
14,
15,
16,
17,
18]. In our study, we used data collected from two sensors in the smartphone—the accelerometer and gyroscope.
The second component of HAR is data analysis. With improvements in technology, cost, and quality of data collection, the main challenge in HAR is shifting to data analysis, i.e., to extract the activities from the sensor data accurately and robustly [
7,
10,
19]. In general, a given data analysis procedure can be divided into three steps: preprocessing, feature extraction, and activity classification [
10]. Preprocessing prepares the data for the analysis at hand. For example, it might include removal of irrelevant high-frequency fluctuations (noise). The feature extraction step involves selecting and extracting representative features from the data. In activity classification, the extracted features are first associated with physical states or physical activities using statistical models. These models are then used to classify activities for new data.
Previous HAR studies have used a variety of feature extraction and activity classification techniques. A rapidly developing field is the application of deep learning, which automates both feature extraction and activity classification. Using multiple layers in the network, the deep learning procedure identifies optimal features from the raw data itself, without human intervention [
20]. Some studies show that this approach can yield highly accurate results in activity classification [
21,
22,
23]. However, there are limitations and challenges in the application. First, a vast amount of data is required to train a deep learning algorithm. Second, the model is usually used as a black box, and the extracted features from the multi-layered procedure can be difficult to interpret [
20], resulting in difficulties in algorithm improvement.
A more traditional approach of data analysis is to view the data in short segments, referred to as windows. This approach allows us to examine the data directly and choose which features to extract through the most appropriate methods. Subsequently, a model may be constructed from training data to connect the selected features to activities. In this paper, we adopted the “movelet method” for feature extraction and activity classification, which was developed by Bai et al. [
24] and later augmented by Huang and Onnela [
11]. The movelet method is tailored to each individual patient by constructing a personal dictionary of windows for different types of activities from her/his training data. The patient’s activities are then inferred by comparing new data with the data in the dictionary [
24]. The unique advantages of the movelet method are that it is intuitive, transparent, and personalized to each individual patient. The movelet method, in comparison to more sophisticated machine learning methods, only requires a small amount of training data (a few seconds per activity).
Some previous studies have used the movelet method to classify activities with a single sensor [
11,
24]. Bai et al. [
24] analyzed data collected by a body-worn accelerometer. Huang and Onnela [
11] applied the method to smartphone accelerometer data and separately to smartphone gyroscope data. The results showed that the smartphone accelerometer and gyroscope each had strengths in picking up different activities. In this study, we analyzed smartphone accelerometer data and gyroscope data jointly. Our hypothesis is that combined information from both acceleration and angular velocity would improve the accuracy of classification because the individual sensors capture different aspects of movement. The previous study by He et al. [
25] used multiple accelerometers fixed to different parts of the body. They found improvements in classification accuracy using the integrated information from the multiple instruments. Although the smartphone is different from body-worn instruments, we expected its multiple sensors to provide similar benefits in improving classification accuracy. In comparison to multiple body-worn instruments, the smartphone has the advantage that it is compact, convenient to carry, and can be used over long time periods. In this paper, we present an extended version of the original movelet method that jointly incorporates smartphone accelerometer and gyroscope data. Moreover, we apply the method to our recent study and discuss the results. Our R code is provided on GitHub.
The paper is organized as follows.
Section 2 describes the data set and presents our method for incorporating accelerometer and gyroscope data jointly in the movelet method. In
Section 3, we present the results of applying this method to the study data set. We also compare the results to those from applying the movelet method to accelerometer data only and to gyroscope data only.
Section 4 summarizes the results and discusses potential areas of future research.
2. Materials and Methods
2.1. Study Data Set
The data set used in this paper is from a study we conducted in 2018. The study included four participants. There were two female and two male participants, ranging in age from 27 to 54. Characteristics of the participants are provided in Table S1 of Huang and Onnela [
11], including sex, height, weight, and dominant hand. For full disclosure, participant 1 is an author of this paper. Each participant had a study visit in which she/he performed a series of activities while wearing a study iPhone in the front right pants pocket and another study iPhone in the back right pants pocket. Throughout this paper, we focus on the front pocket phone, and refer to it as “the phone”.
In our study, we collected data from both the accelerometer and gyroscope sensors in the smartphone, with the phone placed in the front pants pocket. An accelerometer measures the acceleration of a phone along each of three orthogonal axes of a Cartesian coordinate system. The
x-axis and
y-axis are in the plane of the phone’s screen, with
x pointing right and
y pointing to the top of the phone. The
z-axis points up through the phone, following the right hand rule. A gyroscope measures the angular velocity of a phone about three orthogonal axes. In previous HAR studies, a variety of sampling frequencies (samples per second) have been used (e.g., 1 Hz or even 100 Hz), commonly ranging between 20 and 30 Hz [
10]. In our study, we sampled accelerometer and gyroscope data at a frequency of 10 Hz (i.e., 10 samples per second). The sampling frequency of 10 Hz was chosen because it is sufficient for capturing most daily activities.
Participants were observed separately. For each participant, her/his study visit consisted of two phases, training data collection and test data collection. During the training data collection, accelerometer and gyroscope data were recorded as the participant performed designated activities. These activities included walking, standing, ascending stairs, descending stairs, sitting, transitioning from sitting to standing (sit-to-stand), and transitioning from standing to sitting (stand-to-sit). In our analysis for each participant, we used 5 s of training data per activity for the activities of walking, standing, sitting, ascending stairs, and descending stairs. The training data for stand-to-sit used in the analysis came from one transition from standing to sitting. Analogously, the training data used for sit-to-stand was from one transition from sitting to standing. The duration of the training data for sit-to-stand and stand-to-sit were each shorter than 5 s because these activities are momentary transitions. The full protocol for the training data collection is provided in Huang and Onnela [
11] (see Table 1 of their paper).
The test data collection included six steps, where the participant followed a prescribed course of activities on the Harvard Longwood campus. For example, the course in step 1 included walking, ascending stairs, standing, and descending stairs. The participant walked at different speeds in step 3, and ascended and descended a long staircase in step 6. A complete description of steps 1–6 in test data collection is provided in Huang and Onnela [
11] (see Table 1 of their paper). The test data were collected in public spaces outdoors and indoors, not in a tightly controlled lab environment. We chose these public spaces to collect unconstrained environment data. In this paper, we use the test data from steps 1, 2, 3, 5, and 6 in our analysis. The test data from step 4 is not analyzed in this paper. During step 4, the participant repeated the same course four times with the phone reoriented in a different position each time. We discuss the issue of how the phone is carried in
Section 4.
Each participant was filmed throughout the experiment using a handheld camera. The video footage was used to manually annotate the sensor data with ground truth activity labels. The accelerometer and gyroscope measurements from the smartphone, along with the annotated activity labels from video footage, are publicly available on Zenodo [
26].
2.2. The Movelet Method: Single Sensor
The movelet method, proposed by Bai et al. [
24], was originally designed for activity recognition from a body-worn tri-axial accelerometer, but it can be applied to any single tri-axial sensor. In our previous paper, we applied the movelet method separately to smartphone accelerometer data and smartphone gyroscope data [
11]. The method proposed by Bai et al. [
24] has the following procedure.
The movelet method uses pattern recognition or pattern matching. Consider a single tri-axial sensor (e.g., a smartphone accelerometer). For a given participant, let
denote the vector of
x,
y, and
z measurements taken by the sensor at time
t for the participant. We will assume that the sampling frequency of the sensor is 10 Hz. A movelet is defined as a 1-s window of the sensor’s data. Let
denote a movelet beginning at time
t. Then we have
where the time
t is in units of seconds. Thus, the movelet
consists of the time series from all three axes (
x,
y,
z) of the sensor within a second. The time increments are spaced by 0.1 s because this is the reciprocal of the sampling frequency 10 Hz. We set the movelet duration to be 1-s long based on existing literature. Bai et al. [
24] found 1 s to be an appropriate choice because a 1-s window strikes a balance between being long enough to differentiate activities, yet short enough to avoid encapsulating multiple activities.
Before applying the movelet method, a dictionary is constructed from a participant’s training data. In the dictionary, the movelets derived from the training smartphone data are grouped into categories of different activities, based on the ground truth activity labels. In applying the movelet method, new movelets are constructed from new smartphone data. Each new movelet is compared to the movelets in the dictionary, and the most similar dictionary movelet is used to classify its activity. To distinguish between dictionary movelets and new movelets, define to be the set of times t during training data collection, and to be the set of times t during new data collection. Any given dictionary movelet has time , while any given new movelet has time . The process of obtaining and comparing dictionary movelets and new movelets is described in the following paragraphs.
First, the study investigator makes a comprehensive list of daily life activities. Let A denote the number of activities in the list, which consists of activity 1, activity 2, up through activity A (e.g., walk, sit, stand). Training data are gathered by having the participant perform each of the activities while collecting data from the sensor of interest (e.g., a smartphone accelerometer). These training data are then used to build the dictionary for this participant. Each of the A activity entries in the dictionary are composed of multiple movelets, where any given movelet is a 1-s window of the sensor’s tri-axial (x,y,z) data. For each of the A activity entries, the collection of dictionary movelets is obtained using a sliding window process, as described in the following example. Suppose we have a 5-s segment of training data for a given activity. Then one obtains the 1-s dictionary movelets for the activity entry by sliding a 1-s window forward one sample (0.1 s, the reciprocal of sampling frequency) at a time along the tri-axial data, until the right end of the 1-s window meets the last point of the 5-s time series. The resulting number of movelets is 41 from the 5-s time series. The number of dictionary movelets of an entry depends on the data collection frequency and the duration of the training data for the activity. In summary, every dictionary movelet ( is linked to a particular activity entry in the list of A activities.
Next, we perform activity classifications on new data (termed test data here) using the dictionary. For the test data, we also construct movelets by sliding a 1-s window forward, one sample (0.1 s) at a time along the test data time series. Each test movelet is then matched with one of the dictionary movelets based on the smallest discrepancy. Precisely, for a given test movelet
(
), we find
The function
is a discrepancy metric using Euclidean distance that will be defined in
Section 2.3 [
24]. Intuitively, Equation (2) finds the dictionary movelet
with the lowest discrepancy from the test movelet
. The activity label
of the dictionary movelet
is then assigned to the test movelet
as its classification. To classify the activity at a given time point
t, one uses the test movelet beginning at the time point
t and the subsequent nine following it. A majority vote is taken among these movelets, where the activity that receives the most votes is taken as the classification for the time point
t. The rationale behind the majority vote process is that later movelets also contain activity information for the time point because human activities are continuous.
An advantage of the movelet method is its small training data requirement. As demonstrated by Bai et al. [
24], only a few seconds of training data are required per activity. More sophisticated machine learning methods can be applied to link the windows of data to activities [
12,
18], but this requires more training data [
24].
2.3. Discrepancy Metric
In the single-sensor method, a discrepancy metric called
is used to compare each test movelet to each dictionary movelet [
24]. The discrepancy
is defined as follows. Consider a dictionary movelet
and test movelet
. For simplicity, we remove the
t and
and refer to these movelets as
and
, respectively. For the dictionary movelet
, here we let
be the vector of length
n containing the time series data for the
x-axis of the single sensor during the 1-s window. Define the vectors
and
analogously. To be precise, we have
,
, and
, where
. The subscripts 1 through
n represent the different times in the 1-s window. For the new movelet
, we analogously define
,
, and
. Using
and
to represent a pair of vectors of a given dimension from the dictionary movelet
and test movelet
, respectively, the Euclidean distance for the dimension is defined as
In the discrepancy metric, the Euclidean distance is computed for each of the
x,
y, and
z axes, and these three distances are averaged together. Thus, the discrepancy
between the two movelets
and
is:
2.4. The Movelet Method: Joint Sensors
In this paper, we propose an extension to the movelet method in which we use gyroscope and accelerometer data simultaneously. The motivation behind this is that we hypothesized that acceleration and angular velocity provide different physical information, and combining both could improve the accuracy of the activity classifications. Related work includes He et al. [
25], who applied the movelet method with data from multiple accelerometers fixed to different parts of the body. The joint-sensor method using accelerometer and gyroscope data follows the same procedure as in
Section 2.2 and
Section 2.3, except for the key differences that are described below in this section.
In the joint-sensor method, the 1-s movelets include all six dimensions of data (
x,
y,
z from the accelerometer and gyroscope sensors) rather than only three dimensions (from a single sensor). Thus, these movelets are multistream movelets because they come from the accelerometer and gyroscope data streams. The multistream movelets still follow Equation (1), except the data
at any given time
t is now a vector of six values, with
where
,
, and
correspond to the accelerometer and
,
, and
correspond to the gyroscope. In our data collection, the sampling frequencies for the accelerometer and gyroscope were both 10 Hz. However, the measurements from the two sensors were not synchronized. Thus, we required data preprocessing to synchronize the accelerometer and gyroscope data before implementing the joint-sensor method. In our data preprocessing, we linearly interpolated the gyroscope data to the timestamps in the accelerometer data. This was done for both training data and test data. Thus, all dictionary and test movelets in the joint-sensor analysis had six measurements at every accelerometer timestamp. We chose linear interpolation because it is simple and has been recommended for this type of data [
27].
Compared to the single-sensor method, we also used a different discrepancy metric called
in the joint-sensor method to compare our multistream movelets. The discrepancy metric
is defined as follows. As in
Section 2.3, we use
and
to represent a dictionary movelet and test movelet, respectively. Note that now the dictionary movelet
has corresponding data vectors
,
, and
for the accelerometer and vectors
,
, and
for the gyroscope. Analogously, the test movelet
has corresponding data vectors
,
, and
for the accelerometer and vectors
,
, and
for the gyroscope. Moreover, as described above, the vectors for the gyroscope (i.e.,
,
, and
for movelet
and
,
, and
for movelet
) are obtained by interpolating the original gyroscope data to the accelerometer timestamps, so that the two data sources are synchronized.
The distance metric
for comparing
and
is defined as:
Thus, for the joint-sensor movelet method, we compute the distances for the x, y, and z axes of both the accelerometer and gyroscope and average these six distances together.
2.5. Analysis Procedure
We applied the extended version of the movelet method using accelerometer and gyroscope data jointly, and we also compared its classification accuracy to the original (i.e., single-sensor) movelet method using the accelerometer data only and the gyroscope data only.
Table 1 summarizes the key points of the analysis procedure. In the gyroscope-only analyses, we used the original gyroscope data rather than interpolated gyroscope data. This was done to mimic how an analysis using only gyroscope data would be performed in practice. Applying the movelet method to the original gyroscope data resulted in an activity classification for each gyroscope timestamp. Since the accelerometer-only and joint-sensor analyses yielded classifications at the accelerometer timestamps, we then computed activity classifications for each accelerometer timestamp by taking the classification for the closest gyroscope timestamp. The R code for this paper is provided on GitHub at
https://github.com/KebinYan/Code-for-Paper; accessed on 24 February 2022. The analyses were performed using a MacBook Pro laptop with a dual-core Intel Core i5 processor running at 2.7 GHz and 8 GB of 1967 MHz DDR3 onboard memory.
4. Discussion
This study found that combining accelerometer and gyroscope data can result in more accurate activity recognition. For example, the gyroscope-only method had difficulty in differentiating between the activities of standing and sitting, but combining the accelerometer with gyroscope data largely corrected this error. For the activity of walking, combining the accelerometer and gyroscope data improved the accuracy compared to the accelerometer alone in some cases (e.g., participant 1) and to the gyroscope alone in other cases (e.g., participant 3). Although the single-sensor methods using the accelerometer or gyroscope classified ascending and descending stairs to a certain degree, the combined method using both sensors made further improvements in some cases. Our results also showed that for certain types of movement, a properly chosen single-sensor method may be adequate, e.g., accelerometer for stationary activities. These findings highlight the close connections among the specification of scientific questions (e.g., what activities are of interest?), the choice of data types (whether to collect accelerometer data, gyroscope data, or both), and the choice of the data analysis method.
This work expands the range of usage of the movelet method. Previous work on the movelet method mostly concentrated on using a single body-worn accelerometer, or multiple accelerometers fixed to different parts of the body. We extended the movelet method to incorporate different types of sensors from one smartphone. This takes full advantage of the smartphone as a compact and convenient to carry all-in-one instrument, which can sense different types of movement simultaneously over long time periods.
The movelet method is a useful classification tool that is simple to implement in research and clinical settings. Our analyses showed that the movelet method is fairly accurate. Compared with other statistical methods, this relatively simple approach has the advantage of being transparent, intuitive, and interpretable. Therefore, the movelet method can be used together with more complex methods, such as deep learning, so that we can gain more insight into the classification procedure. Moreover, given that the movelet method makes activity classifications for each person based on her/his own dictionary, the classifications are personalized to the individual’s unique data patterns and, therefore, account for factors, such as the person’s height, weight, age, and health conditions. Models built using training data from one cohort (e.g., young, healthy people) may perform poorly when applied to another group (e.g., older adults or patients with illnesses) [
28,
29].
As our analysis results show, one remaining problem is that the joint-sensor method and the two single-sensor methods all had difficulty with accurately classifying slow walking. To address this issue in future work, we plan to develop an extension to the joint-sensor method that allows for movelet transforms, which stretch or compress the 1-s dictionary movelets [
30]. The purpose of movelet transforms is to improve activity recognition in cases where the participant performs a given activity at a different pace during testing, compared to during training. These transforms may be helpful in improving the accuracy of slow walking recognition. They can also be adaptable in cases where a patient’s condition evolves over time (e.g., a patient’s walking pace may increase over time as she/he recovers from surgery). In future work, we also plan to use smartphone sensor data to examine a patient’s gait patterns, in addition to performing her/his activity recognition. For example, there is existing work about analyzing the human gait using sensor data, including estimating gait parameters (e.g., average stride duration) as well as detecting abnormalities in gait [
31,
32]. In future studies, we can investigate methods to analyze the human gait using smartphone accelerometer and gyroscope data jointly.
One limitation of this work is that we studied the specific case where the phone is worn in the pocket. In reality, the phone can be carried in different locations (e.g., pocket, hand, backpack, purse) that can change with time. The specific context may also differ (e.g., phone in a tighter pocket or oriented in a different direction). An area of future work is to extend the joint-sensor method to accommodate these changes robustly. One approach is to identify the location of the smartphone placement based on the accelerometer and gyroscope data, and then apply the appropriate dictionary accordingly. We may also consider standardizing the training and testing data based on the phone’s placement to reduce the context influence on the amplitude.
This work was a pilot study using a small sample size collected by the investigators. The small sample size is a limitation of this study. Our goal in this pilot study was to understand each sensor’s role and how the combination of the sensors could provide further information. To achieve this goal, we performed a detailed analysis at the highest possible frequency, verifying the activity classification at each time point and for each activity. We believe these results can apply in more general situations, but this should be confirmed in a study with a larger sample size. We are planning such a data collection.
In our future work, we will further develop the movelet method and apply it in free-living environments. On the one hand, we will thoroughly evaluate the performance measures of the joint-sensor method, including sensitivity, specificity, and precision for each activity type. This is aligned with our current planning for a major effort to collect multi-sensor data with a large sample size of diverse participants. On the other hand, we will improve the movelet methodology and combine it with other advanced statistical tools. First, we will build more sophisticated dictionaries with more categories of activities. Future data collection and analyses can incorporate new activities that are not in the current dictionaries, such as the activity of running. Second, we will automate the customized dictionary generation process for each individual person using machine learning techniques. In addition to these major developments, we will fine-tune and expand the joint-sensor method that we are using now. For example, our current analysis applied linear interpolation to interpolate the gyroscope data to the accelerometer timestamps. An area of future work is to test other interpolation methods, such as cubic splines or B-splines. Moreover, based on our analyses, gyroscope and accelerometer data seemed to play different roles in identifying different types of movement. To take advantage of these differences, we will evaluate whether assigning different weights to the two sensors and their axes (x, y, and z) can improve the accuracy of the joint-sensor method.