1. Introduction
The modeling of human mobility is an emergent research area. Studying the regularity and characteristics of human spatiotemporal mobility is of great significance in many fields, such as urban planning [
1,
2], traffic forecasting [
3], and epidemic prevention [
4,
5].
When modeling human mobility, it is common to consider the probability distribution function (PDF) of its metrics (e.g., travel distance, travel time, and travel speed). It is generally accepted that daily and hourly human mobility metrics have a representative distribution [
6]. Modeling the distributions of these metrics is fundamental, necessary, and beneficial for studying underlying travel patterns and establishing a base for human mobility research.
Recently, with the rapid development of information and communication technique (ICT) and location-based service (LBS) applications, online car-hailing equipped with Global Positioning System (GPS) plays an increasingly important role in people’s daily travel activities. As an important data source, online car-hailing platforms (e.g., Uber, Lyft, and Didi Chuxing) generate a large amount of accurate location data. Unlike traditional survey data, cell phone datasets, wireless network traces, and taxi locations data, online car-hailing data are characterized by high-quality, high-resolution and a large-scale, which reflect the detailed spatiotemporal trajectory and actual origin and destination of people’s travels. Therefore, online car-hailing has established a rich and solid data foundation for distribution modeling, thus bringing new opportunities and challenges to further understand people’s travel behavior and intra-urban mobility.
To our knowledge, previous studies have proposed quite a few patterns of human mobility, such as levy flight model [
7,
8], power law distribution [
9,
10,
11,
12,
13], exponential distribution [
14,
15,
16,
17,
18], lognormal distribution [
19,
20,
21,
22], Weibull distribution [
23], Pareto distribution [
24], and Gamma distribution [
22]. Brockmann et al. observed that human travel distance exhibits a power law distribution by analyzing the statistical properties of bank notes’ circulation, and human travel trajectories may be approximated as Lévy flights (heavy tailed random walk) [
7]. This observation was confirmed by Rhee et al. using Global Positioning System (GPS) traces collected from volunteers, showing a non-negligible probability of high displacement trips and a long pause-time between trips [
8]. Despite the randomness indicated by Lévy flight models, a power law with an exponential cutoff can be used to approximate the displacement distribution of human trajectories obtained from mobile phone datasets [
9,
11], GPS traces [
12] and online location-based social networks [
13]. However, Kang et al., Jiang et al., and Liang et al. pointed out that an exponential distribution can be used to approximate taxis’ travel displacement and travel time, instead of power law [
14,
15,
16,
17,
18]. Furthermore, by analyzing a taxi-trace dataset, Wang et al. found that displacement tends to follow exponential distribution, and travel time is approximated by lognormal distribution [
19].
Although the findings mentioned above provide a beneficial reference on mining human mobility, they mostly focus on a single model, which may not fit all data well. Zheng et al. found that a fusion function, based on exponential power law and a truncated Pareto distribution, represents travel time distribution best [
24]. Bazzani et al. studied the GPS data of private cars in Florence, Italy, and found that the single-trip length follows an exponential behavior in short distance scale but favors a power law distribution for trips longer than 30km [
18]. Csáji et al. and Zhang et al. found that exponential distribution is not appropriate for travel distances, and log-normal distribution provides reasonable fits [
20,
21]. Plötz et al. used Weibull, Gamma, and lognormal distributions to fit individual daily driving distances, and found that Weibull and lognormal most often perform better than Gamma, and the Weibull distribution fits most data but not all [
23]. Kou and Cai analyzed the distributions of travel distance and travel time, and found that both of them follow a lognormal distribution in larger bike sharing systems, while the distribution for smaller systems varies among Weibull, Gamma, and lognormal [
22].
To sum up, according to various datasets, many empirical studies have demonstrated that mobility metrics may be fitted with several meaningful distributions, such as Lévy flight models, exponential, power-law, lognormal, Gamma, Weibull, and Rayleigh. However, based on a real large-scale car-hailing trajectory dataset, can a single or mixed model achieve a good fit for all data? It remains to be further explored. In addition, the above studies focused on modeling the distribution of the human mobility with simple or overall data, while ignoring the variability of PDF along with day of week and time of day. Is the distribution type of mobility metrics in different time granularity different from that of overall data? If yes, how will it vary, and can it be described by a general distribution? This has aroused the interest of many scholars. Therefore, our research, based on a real large-scale car-hailing trajectory dataset, is indispensable, and gains valuable insight into human mobility patterns.
To fill this gap, this paper aims to model the distribution of human mobility metrics in different time granularity. Specifically, three metrics (travel distance, travel time, and travel speed) are introduced to explore massive trajectory data collected in Xi’an, China. For each mobility metric, several candidate distributions are compared based on model selection criteria, and the best one is selected. The statistical distributions of daily trip data are analyzed first, and they show the characteristic of skewed distribution. More granularly, hourly distributions are further evaluated, and a general distribution for each mobility metric is determined.
The remainder of this paper is organized as follows.
Section 2 briefly introduces the online car-hailing dataset and carries out the basic analysis.
Section 3 describes the trip metrics, the fitting distributions, and the method of model selection.
Section 4 presents the result and analysis.
Section 5 discusses the findings. Finally,
Section 6 provides conclusions and recommendations for further research.
2. Data Collection and Basic Analysis
2.1. Data Description
The adopted trajectory data were generated by about 18,000 online car-hailing trips in Xi’an, China, from 1 October 2016 to 30 November 2016. Vehicle trajectories were composed of high-resolution GPS points, which were recorded every 2–4 s. Accordingly, an online car-hailing trajectory is a sequence of GPS sampling points with five fields. The vehicle ID and order ID were desensitized to protect privacy. The ‘‘Timestamp’’ indicated when the data would be recorded, which was the UTC time. ‘‘Latitude’’ and ‘‘Longitude’’ provided location information of online car-hailing.
Let denote the trajectory of the trip of vehicle , where is the point of the sequence (). denotes the location and the timestamp, respectively. Given a trajectory, . For a vehicle, the origin and destination (OD) locations are the first and last sampling points of a trip. It makes sense to define and . Hence, each OD trip can be simplified to be a vector from to .
A road network consists of a set of nodes, directed links, and allowed movements. Each node is a geographical location representing a network intersection, which can be either signalized or non-signalized. A link is defined to be the road section from its tail node to head node. The relative position denotes the ratio of a sampling point relative to the link start node, which ranges . For example, the value 0, 0.5, and 1 of the relative position represent the beginning, middle and end of a link.
2.2. Data Precessing
To model the travel distance distribution, the map matching (MM) and the path inference algorithm proposed by Chen et al. were first used [
25]. As shown in
Table 1, the original latitude and longitude were converted into geodetic coordinates, which could be used directly to calculate the travel displacement. Secondly, the relative position of the sampling point on the link was also calculated, and the UTC time was converted to the time of day (0-86400 s). Thirdly, the pick-up and drop-off points were extracted to calculate the trip metrics (e.g., travel time, travel displacement, travel distance).
Finally, data cleaning is an essential task, because some of the trip records were not suitable for use in this study. Considering travel costs, few passengers travel by online car-hailing when travel time and distance are very short or long [
26,
27]. In addition, travel speed should be within a reasonable range. Therefore, the following conditions resulted in exclusion of trip records from the study data: (1) travel distance and displacement between origin and destination of less than 300m; (2) travel time less than 1min or longer than 2 h; (3) average travel speed below 5 km/h or in excess of 80km/h [
28].
In terms of the total trips of two months, 6,203,848 trips were obtained from 6,584,397 original trips after data cleaning, which means that about 6% of the trips were filtered out. From the perspective of daily trips, the average order availability was 94.22%, which fluctuated between 93.21% and 94.85%. More commonly, the study period was discretized into 1464 (24*61) 1 h intervals for further analysis of residents’ hourly trips. The hourly trip quantity ranged from 192 to 8636, as shown in
Figure 1. On the whole, the number of trips during the day was much higher than at night, which is in line with human mobility. After all, human mobility during the day is more active, important, and meaningful. In addition, the number of hourly trips between 00:00 and 07:00 may be less than 2000, but it was sufficient for distribution fitting.
4. Results of the Best-Fit Distribution
This section reports the fitting results of the trip metrics using the candidate skewed distributions. The best-fit distributions of daily trip metrics are first shown. Then, we further analyze the best-fit results of hourly trip metrics, such as travel distance, travel time, and travel speed. Finally, we attempt to identify a general distribution for each trip metric, and to estimate its parameters.
4.1. Best-Fit Distribution of Daily Trip Metrics
Figure 4 shows the best-fit distributions of daily trip metrics for 61 days. Overall, only two of the candidate distributions, Gamma and Burr distributions (represented by blue and red), are suitable for fitting these trip metrics. Gamma distribution performs best for travel distance, and can uniformly fit all daily data, as shown in blue in
Figure 4. However, Gamma distribution does not fit all travel time or speed data well. In total, 47.54% (29 out of 61) of travel time data and 63.93% (39 out of 61) of travel speed data follow the Burr distribution (depicted in red).
The fitting results for weekdays and weekends in
Figure 4 are distinguished by two different markers (dot and star). For travel distance, data on weekday and weekend can be well fitted with the same distribution. However, for travel time data at weekends, a third of them are subject to Gamma distribution, the rest to Burr distribution, and the weekend’s speed data are the opposite. Meanwhile, only 37.21% (16 out of 43) of travel speed data on weekdays obey Gamma distribution, which further decreases to 13.64% (3 out of 22) on November weekdays.
The above analysis also shows that, due to uniform and simple fitting distribution, travel distance is more straightforward for analyzing residents’ mobility patterns. In comparison, travel time and speed are relatively complex metrics due to uncertain distribution types.
In addition, the mean BIC weights of travel distance, travel time, and travel speed are 1, 0.9996, and 0.9960, respectively, which indicate low uncertainty of fitting results. The smallest BIC weight occurs in the travel speed data on October 13, and the fitting distributions and the detailed parameters are shown in
Figure 5. It can be found from the figure that the Burr distribution is the most consistent with the speed data, followed by the Gamma distribution. It is noteworthy that the Burr distribution is more complex than the Gamma distribution. The likelihood of achieving a better fit of the more complex model is significantly greater than that of the simpler model, but the model fit and complexity should be considered comprehensively.
Table 2 shows more parameters in the model selection. The observed mean and standard variation (STD) are very similar to several estimates. The commonly used evaluation indices, such as the mean absolute percent error (MAPE) and the root mean square error (RMSE), are also difficult when it comes to identifying the dominant distribution. Furthermore, the log-likelihood of Burr distribution is slightly bigger than that of Gamma distribution, which quantitatively proves that the Burr distribution has a better model fit. When the BIC takes model complexity into account, the gap narrows to 2 (
), which means that the benefit of improved model fit outweighs the cost of added model complexity. This tiny advantage is clearly distinguished in the BIC weight, which makes the model selection more explicit. It can be determined that the best-fit model is Burr distribution.
4.2. Best-Fit Distribution of Hourly Travel Distance
In order to further investigate the hourly distributions of trip metrics, the study period is discretized into 1464 (24*61) 1 h intervals.
Figure 6 shows the best-fit distributions of hourly travel distance. The best-fit distributions are distinguished by five different markers, of which the Gamma distribution accounts for 93.10%—far higher than the other four distributions. Between 07:00 and 24:00, the proportion of Gamma distribution rises to 99.32%, which further demonstrates the advantage of Gamma distribution in fitting travel distance data. During the times 00:00–07:00, the optimal distribution varies with hours and days, and is chaotically represented in five different markers. Only 77.99% of the data still obey Gamma distribution, while 14.52% for Weibull distribution, 6.79% for Rayleigh distribution, and less than 1% for lognormal distribution. This may be due to the small sample size at night.
In addition, the BIC weights represented in different colors range from 0.38 to 1. The darker the color, the smaller the BIC weight. It can be seen from
Table 3 that most of the BIC weights between 07:00 and 24:00 are very close to 1. Their mean BIC weight is 0.9986, indicating the high reliability in the model selection. However, the mean BIC weight from 00:00 to 7:00 is 0.8540, which indicates that there is some uncertainty in the fitting results. More specifically, the mean BIC weights of the lognormal, Weibull, and Rayleigh distributions are 0.8362, 0.7094, and 0.7210, respectively. The high uncertainty may be caused by small sample size, because fewer people travel at night. Meanwhile, lower weights also mean that the best and suboptimal fitting distributions may both be applicable to the data. In conclusion, Gamma distribution may also be applicable to all the travel distance data.
4.3. Best-Fit Distribution of Hourly Travel Time
Figure 7 shows the best-fit distributions of hourly travel time, which are distinguished by different markers. As shown by the black circles and red squares, Gamma and Burr distributions are the dominant distribution types, with proportions of 76.23% and 20.56%, respectively. In
Table 4, larger BIC weights (0.9430 and 0.8982) of these two distributions show a significant advantage over other three distributions when fitting 96.79% of the data. Meanwhile, the fitting results have higher reliability.
On the other hand, the other three distributions, shown by blue dots, green triangles, and red stars in
Figure 7, account for less than 4%, and appear mainly at night (02:00–7:00). At the same time, their average BIC weights are only 0.7515, 0.6296, and 0.7853, respectively. These lower BIC weights indicate the higher uncertainty of the Lognormal, Weibull, and Rayleigh distributions in fitting the data, which are likely to be replaced by Gamma or Burr distribution.
During the National Day holiday, 82.74% of travel time data follow Gamma distribution with a mean BIC weight of 0.9541. However, only 13.69% obey the Burr distribution, with an average BIC weight of 0.9352. This means that travel time data for the National Day holiday are more inclined to the Gamma distribution than the Burr distribution.
4.4. Best-Fit Distribution of Hourly Travel Speed
Figure 8 shows the fitting results of hourly travel speed. The best-fit distributions are distinguished by rhombus, circle, square, and star. On the whole, Burr distribution accounts for 85.11% (1246 out of 1464) of the best-fit distributions, with a mean BIC weight of 0.9830, showing an absolute advantage and high reliability. Similarly, the lognormal and Gamma distributions have higher BIC weights but lower ratios. In contrast, the fitting results for only 6 1 h intervals (0.41%) are consistent with Weibull distribution, with an average BIC weight of 0.7007. This means that Weibull distribution is not suitable for speed data.
In addition, some clustering features can be found in non-dominant best-fit distributions. For example, 77.94% (53 out of 68) of the lognormal distributions occurs during the daily evening peak, with a mean BIC weight of 0.9986. About 40% of the Gamma distributions exists during the National Day holiday, with a mean BIC of 0.9246. In conclusion, Burr distribution is dominant in fitting travel speed data.
4.5. General Distribution Selection
Based on the above analysis, the Gamma distribution is dominated in the fitting of travel distance and travel time, while the Burr distribution is more suitable for travel speed. Trips following other distributions only makes up a very small portion of the total trips. Now, it may be questioned whether it is possible to fit all the corresponding data with the dominant distribution separately? If so, how much worse will the fit be? To answer this question, the K–S test, the BIC difference, mean absolute error (MAE), and mean absolute percentage error (MAPE) are analyzed separately.
As shown in
Table 5, among the trip metrics, 101, 348, and 218 out of 1,464 1 h intervals are, respectively, replaced by Gamma or Burr distribution. For travel distance, the K–S test considers both the alternative distribution (Gamma) and the best-fit distribution as acceptable, respectively, for about 90% (93 or 90 out of 101) of the data. Meanwhile, the alternative distribution performs better for travel time because more data (76%) pass the K–S test. However, for the travel speed metric, the opposite is true, which needs to be further explained by other indicators. Moreover, selection of the more complex model indicates that the benefit of improved model fit outweighs the cost of added model complexity.
For the first two trip metrics, the BIC difference between the alternative distribution and the best-fit distribution is relatively small, both being less than 10. However, the BIC difference for travel speed is slightly larger, possibly due to the different magnitude of the BIC values between the metrics. The ratio () of BIC difference to the BIC of the best-fit distribution is less than 0.5%, which also indicates that there is no significant difference between the two distributions from model selection based on the BIC. In addition, the MAPE and MAE of the fitted distribution and sample distribution at the 10th, 50th, and 90th percentiles are calculated by comprehensively considering the mean and variance. A MAPE of less than 4% further demonstrates the feasibility of fitting all data with a dominant distribution.
5. Discussion
Based on the analysis in the previous section, the direct statistics (i.e., travel distance, and travel time) of hourly trips all follow Gamma distribution, while the indirect statistic (i.e., travel speed) obeys Burr distribution.
Table 6 lists the average indicators of the trip metrics, which reflects the performance of the Gamma and Burr distributions. The bigger the R2 is, the better the goodness-of-fit is. Alternatively, smaller MAE, MAPE, and PPC also indicate a better goodness-of-fit. Three key percentiles, 10th, 50th, and 90th, were adopted to calculate the MAE and MAPE. The confidence level is equal to 80% (i.e.,
) was adopted to construct the confidence interval.
According to
Table 6, the mean R
2 values of three trip metrics exceed 0.98, and even reach 0.9915 for travel time. This shows that the fitting distribution has a strong ability to interpret data, and this model is also good at fitting data. In general, a higher R
2 indicates a stronger interpretation ability of the fitting model to the data; that is, a better fitting effect. Meanwhile, the MAPE and MAE indicators further prove this. The MAE of travel distance is less than 0.1km, lower than 0.2 min for travel time, and about 0.21 km/h for travel speed. Less than 3% of MAPEs further illustrate that the fitting distribution is quite consistent with the observed data.
The mean POPIs are slightly worse than the target (20%), which indicates that less than 80% of observation data are covered by the fitted confidence interval. Higher POPI means lower POOI. As mentioned earlier, the fitting distributions work best when both POPI and POOI are close to the target value (20%). The PPC values, 4.31%, 2.54%, and 3.11%, respectively, show a low deviation degree of POPI and POOI, which also declares that the fitting distributions of three trip metrics have good accuracy. In summary, these results indicate that the Gamma distribution fits direct trip metrics, such as travel distance and travel time, well, while the Burr distribution fits travel speed better.
6. Conclusions
This study models the distributions of human mobility metrics based on actual trajectory datasets, including about 18,000 online car-hailing rides, collected in Xi’an, China. Three trip metrics—travel distance, travel time, and travel speed—are highlighted in order to establish a base for human mobility research. Results of this study provided several new insights on relationships within human mobility.
First, the mobility metrics tend to right-skewed distribution rather than normal distribution based on online car-hailing trajectory data. By analyzing the daily and hourly trip data, five of the most widespread right-skewed distributions (i.e., Lognormal, Gamma, Weibull, Burr, and Rayleigh) in the scientific literature were selected as candidate distributions. By leveraging the Bayesian information criterion (BIC), we comprehensively analyzed the goodness of fit and complexity of the candidate distributions for each metric, thus acquiring the best fitting distribution and suitable parameters. The empirical results based on online car-hailing trajectory dataset in Xi’an, China, have provided strong evidence that the mobility metrics obey the right-skewed distribution.
Second, the distribution types of mobility metrics vary, along with day of week and time of day, which means that a single distribution cannot fit all the daily and hourly data well. Initially, the Gamma distribution performs best among all alternative distributions for travel distance and can uniformly fit all daily data. Then, the Gamma or Burr distribution can only achieve a good fit in part of the daily travel time or speed data. For the hourly data, the best-fit distributions vary among alternative distributions, especially at night. The Gamma distribution most often performs better than the other four distributions for both travel distance and travel time, while the Burr distribution performs best for travel speed.
Third, although uncertain distribution types exist in the daily and hourly data, a dominant distribution exists in each mobility metric. For example, the Gamma distribution can fit more than 90% of hourly travel distance data, and the Burr distribution can achieve a fit for 85% of hourly travel speed data. Further analysis shows that it is feasible to fit all hourly data with a dominant distribution, respectively.
It is expected that the findings from this study can promote understanding about intra-urban human mobility and lay a solid foundation for human mobility research. However, we do note several limitations of this research. Firstly, the candidate distributions are limited to five commonly used skewed distributions, and more may need to be considered. Secondly, only the distribution of daily and hourly trip data is fitted and analyzed; the more fine-grained distribution (e.g., 30 min, 15 min) is also of interest, which still needs further study. Last but not the least, distribution may vary with the different datasets; multi-source data need to be taken into account to confirm the conclusions.