1. Introduction
Air pollution, which includes emissions from vehicles, industrial processes, and other sources, contributes to environmental degradation. Pollutants can harm ecosystems, damage vegetation, and affect water quality, thus undermining the sustainability of natural resources. On the other side, poor air quality caused by pollution can have severe health consequences, leading to respiratory diseases, cardiovascular problems, and even premature death. Therefore, quantifying the health impact of air pollution plays an important role in urban sustainable development.
Exposure refers to the dynamic interaction between air pollutants and the surface of human body, delineating the interplay between the environment and the human body. Assessing the level of pollutant exposure involves evaluating both the duration of contact and the concentration of associated pollutants [
1]. As one of the environmental problems derived from industrialization, the air pollutant PM
has a serious impact on the health of residents. Both long-term [
2] and short-term [
3] exposure to PM
will have harmful effects on human health, especially increasing the risk of cardiovascular and respiratory diseases, as well as lung cancer, thus directly affects the health of residents. Since there is no established research indicating that PM
concentration below a certain threshold is entirely harmless to humans, it is important to minimize exposure levels as much as possible. As PM
has significant detrimental influences on human health, it is crucial to quantify exposure to PM
throughout the day. The estimation of exposure can stimulate more discussions about public health concerns [
4], provide more precise health guidance for individuals, and also offer a scientific basis for comprehensive health management of residents.
In the past few years, the quantification of human exposure to pollutants has been constrained by the availability of extensive data and computational resources. Early studies predominantly focused on aggregate level exposure assessments, such as community level or neighborhood level [
5,
6,
7,
8]. This approach, however, is susceptible to the Modified Area Unit Problem (MAUP) [
9], where the outcomes are influenced by the geographical units or spatial scales employed in the studies. Such aggregated analyses may only offer a partial representation of actual exposure scenarios, potentially leading to imprecise conclusions. Another limitation of aggregate level studies is the omission of the mobility behavior of individuals. These studies, by basing exposure assessments predominantly on pollutant concentrations at individuals’ residences and overlooking their mobility behaviors, give rise to the Neighborhood Effect Averaging Problem (NEAP) [
10]. Park et al. [
11] validated this problem, emphasizing the necessity of incorporating spatiotemporal variations in both human mobility and pollutant concentrations to enhance the accuracy of exposure assessments.
In urban built environment, particularly where traffic congestion and long-distance commutes are prevalent [
12], individuals may spend substantial time away from their residences, making the consideration of mobility patterns in exposure assessments indispensable. Therefore, it is necessary to assess exposure from the individual level rather than aggregate level. Traditional methodologies for individual level exposure quantification have relied on questionnaire surveys to gather trajectory information [
13,
14]. However, these methods are labor-intensive, costly, and impractical for large-scale population studies.
The advent of big trajectory data, such as Call Detail Records (CDR), has provided new opportunities for modeling human mobility [
15,
16,
17]. Notable studies have leveraged mobile phone data to elucidate disparities in PM
exposure. For instance, Xu et al. [
18] utilized CDR data to investigate environmental justice aspects of PM
exposure in Beijing, revealing economic disparities in exposure levels. Similarly, Guo et al. [
19] examined exposure disparities across multiple temporal scales, although their study was limited by the short duration of mobile phone data available, restricting the ability to assess long-term stable exposure patterns of residents. Besides, it is worth noting that research on the disparity of environmental air pollution exposure mainly focuses on developed countries [
20,
21,
22,
23,
24,
25], while developing countries, despite suffering from more severe air pollution, have relatively limited research on such exposure inequalities [
26,
27,
28]. To the best of our knowledge, there is also no paper studying the residents’ individual exposure to PM
in Shanghai, which is one of the most iconic cities in China.
In this paper, we propose a big data analytics framework to accurately quantify individual PM exposure in Shanghai by coupling mobile phone data with PM concentration data at a fine scale. The mobile phone data is generated by the interaction between mobile phones and communication base stations in daily life from January to April 2014. By performing stay point detection on mobile phone data, we can identify the user’s place of residence and work, as well as the complete daily trajectory. Moreover, to infer fine-grained PM concentration, we combine two types of data: station monitoring data with high temporal resolution and China High Air Pollutants (CHAP) data with high spatial resolution. This paper proceeds to compute the individual’s exposure to PM by utilizing their stay behavior and environmental corresponding PM concentration. By comparing the results with residence-based exposure, we demonstrate the importance of mobility in measuring exposure. In addition, we analyze the spatial and temporal variations in individual exposure, providing new perspectives and data support for policy formulation.
This paper is organized as follows:
Section 2 describes the data utilized in this study and the methodology we propose,
Section 3 presents the obtained results. In
Section 4, we further discuss the obtained results, and
Section 5 presents our conclusions and future work.
2. Materials and Methods
2.1. Data Description
In this section, we introduce the datasets used in this study. including Call Detail Records (CDR) data and PM concentration recordings.
When a mobile phone user performs operations such as turning on and off the phone, making a call, sending a text message, or using the mobile data network, his mobile phone exchanges information with the base station on a regular or occasional basis to ensure the quality of communication and to perform billing operations. In this process, communication operators will record these interaction timestamps and interactive base station code or location and other information in real-time. In addition to the data generated by the active operation of the mobile phone user, the base station also periodically detects the signal of the mobile phone and interacts with the mobile phone according to a specific time period.
Considering that the mobile phone always interacts with the nearest base station, the CDR data can reflect the user’s location. In this paper, we use the Call Detailed Record data of the Shanghai area provided by the communication operator, which contains the records of one million anonymous users exchanging information with the base station from 1 January 2014, to 31 April 2014. Each record contains the user’s anonymous ID, the timestamp of the interaction with the base station, and the latitude and longitude of the base station, as shown in
Table 1. These one million users generated nearly 60 billion records.
To protect users’ privacy, the CDR data we use is not the most recent. This CDR dataset is only utilized to support this research and has not been made public. Considering that the land use types in Shanghai are relatively stable, correspondingly, the spatial distribution of the population and daily mobility behavior in Shanghai are also stable. Therefore, even though the data we use is collected in 2014, we can still obtain reliable information about users’ daily mobility behavior for our research.
The calculation of individual PM exposure requires PM concentration data with high spatial and temporal resolution. In this paper, we fuse two PM datasets to generate hourly PM concentration data for each 1-km grid.
Firstly, we collected the air quality data from the national fixed monitoring stations provided by China National Environmental Monitoring Station. Specifically, we selected the real-time concentration data of PM
monitored hourly by fourteen fixed monitoring stations in Shanghai for research. The spatial distribution and the number of monitoring stations are shown in
Figure 1.
Shanghai’s environmental monitoring stations were only able to provide hourly PM
monitoring data after May 2014. We compared the daily average PM
concentration of nine monitoring stations in Shanghai within four months in 2014 and 2015. The results in
Table 2 show that there is slight difference in the average PM
concentration data between 2014 and 2015 in Shanghai. Therefore, in this paper, we used the PM
data from January to April 2015 provided by the environmental monitoring stations in Shanghai to extract the daily pattern of PM
.
The other dataset is China High Air Pollutants (CHAP), a high spatial resolution and high-quality near-surface PM
pollutant dataset in China reconstructed by Jing et al. [
29,
30]. They constructed a Space-Time Extra-Trees (STET) model by fusing aerosol optical depth (AOD) data, meteorological data, land surface conditions, and population distribution to estimate the concentration of PM
. This dataset provides the daily average PM
surface concentration on a 1-km grid over China from 2000 to 2021. The cross-validation determination coefficient and root mean square error of the model used to estimate PM
in this dataset were
and
respectively. In this paper, we select the daily PM
concentration data in Shanghai from January 2014 to April 2014 within the scope of 1-km grids to calculate the individual PM
exposure.
For the daily PM
data in the Shanghai area provided by the CHAP dataset, we first performed the visual analysis, as shown in
Figure 2. From the temporal perspective, the pollution problem is more prominent in January of winter 2014, with the monthly average concentration of 62 ug/m
, while the average concentration of February, March, and April are 42 ug/m
, 44 ug/m
and 38 ug/m
respectively. From the spatial perspective, the spatial distribution of PM
concentration in Shanghai showed a trend of high in the west and low in the east.
2.2. High-Resolution PM Data
As introduced above, we used two PM
datasets in our research. The PM
data provided by the fixed monitoring station has high temporal resolution and low spatial resolution, while the CHAP dataset has low temporal resolution and high spatial resolution. Therefore, we combine these two spatiotemporal PM
concentration datasets and derive a new PM
dataset that can provide hourly PM
concentration data in every 1-km grid in Shanghai. We used the hourly PM
concentration provided by the fixed monitoring station to correct the daily average PM
concentration data of the 1-km grid provided by the CHAP dataset to infer the hourly PM
concentration of each 1-km grid in Shanghai. In order to obtain the hourly average PM
concentration data based on the corresponding daily average PM
concentration data, we define a parameter named
correction factor. For each fixed monitoring station, we estimate its hourly average PM
concentration in the
m month based on the data provided by the fixed pollutant monitoring station and then calculate the average PM
concentration for the whole month. The ratio of these two concentration records is the hourly correction factor for the monitoring station this month. Specifically, for monitoring station
f, we use
to represent its PM
concentration on day
d hour
h in month
m and the correction factor
of the station at the
h hour of the
m month is calculated as follows:
where
M represents the set of days in the m-th month. Since we have 14 fixed monitoring stations, we can calculate a total of
correction factors.
Section 3.1 provides an example of calculated correction factors for monitoring stations.
Subsequently, we map the daily average PM
concentration data of the 1-km grids in the CHAP dataset to the grids in Shanghai and delete the grids without mapping values. Following this, we calculate the average of the grids with multiple mapping values. We regard
as the daily average PM
concentration of the
in the d-th day, the m-th month. Then for every grid, we use the correction factor of the fixed monitoring station that has the nearest distance to it to correct its PM
concentration data and obtain the hourly average PM
concentration in this grid. This operation is shown in the following equation:
Now, we have obtained the PM concentration of 24 h per day from January to April 2014 in Shanghai and the high-resolution (per hour for each 1-km spatial grid) PM data is of great importance to calculate the individual PM exposure.
2.3. Recognizing Individual’s Stay Locations
As we introduced above, the CDR data reveals users’ mobility behavior. In order to infer the mobility trace of residents, we must know the specific location where the user stayed. There are two types of stay behavior [
31]. One is when the user’s coordinate is completely kept at the same location for a period of time, which is unusual because even at the same location, the user’s mobile phone usually produces slightly different records. The second type of stay behavior is more common and shows that the individual moves or stays within a certain range of the same location, but the presence of different base stations in the vicinity leads to subtle differences in their location record data. Therefore, it is not credible to determine the user’s historical stay location only based on their coordinate changes.
In this paper, we utilize the clustering method proposed by Jiang et al. [
16] to recognize users’ stay behavior based on their CDR data. As
Figure 3 shows, we cluster the CDRs from the temporal dimension and spatial dimension respectively. By doing this, we can filter out the disturbance of the user’s coordinates among base stations and delete the outliers so that we can cluster different records near the same location into a single point.
Firstly, we apply clustering in the temporal dimension to filter out the disturbance of locations. We cluster the points, which are temporally and spatially close in the record sequence into a single location, and take the difference value between the first record and the last record in the clustering as the dwelling time of this point. For example, assuming that user i has a CDR sequence , where is a 3-tuple recording the timestamp and coordinates of the k-th record. By setting a distance threshold (500 m), we cluster CDRs within the threshold to their center point (the point with the smallest sum of distances to other points), and we calculate the time difference between the earliest record and the latest record as the user’s dwelling time at this cluster point. After this process, is transferred to a new sequence , where , is a four-tuple to record the arrive time, dwelling time and coordinates after clustering of the k-th record.
Then we apply the clustering operation in spatial dimension to filter out outlier locations. Specifically, we set (500 m) as the distance bar to further cluster locations in the CDR sequence. Here we only merge spatial-closed locations and delete records whose dwelling time is less than (10 min in this paper) after spatial clustering. Then we get the final stay behaviors , where . In this way, we finally filter out the locations that users pass by and retain the long-time stay behaviors that are conducive to downstream modeling tasks.
After generating users’ daily stay locations from their CDR data, we further identify the location of the users’ residences. On the one hand, the location of the user’s home is convenient for us to calculate the pollutant exposure based on residence. On the other hand, it is important to understand the location of the home in the mobility trajectory, because the environment and landuse around residences of users affect their daily travel and activities, which is related to their mobility pattern. Assuming that most users go out during the day on weekdays and return home from their workplaces at night, we define the location with the highest frequency of visits on weekday nights and all day on weekends as the user’s home location. If a user’s total number of visiting ’home’, which is calculated by the above rule, is less than 10, then we claim that this user is a short-term visitor of the city and delete his records. Detailed results of this subsection can be found in
Section 3.2.
2.4. Calculating Individual Exposure
As we mentioned above, the essence of quantitatively describing exposure is to focus on the concentration of pollutants and the duration of contact [
1]. Duan [
32] has once provided a method to calculate exposure by linearly combining concentration and dwelling time. In this paper, we implement two methods to quantitatively calculate individual exposure:
represents the exposure calculated solely based on the home location of an individual, and
represents the exposure calculated based on the mobility behavior of an individual. As the traditional method of estimating the exposure based on residences,
only uses the location of users’ homes and does not take count of their mobility behavior. The exposure calculation method based on mobility behavior,
, takes into account people’s stay at various locations during one day and evaluates the impact of staying at a specific location on the exposure. The calculation method of these two exposure metrics of user
u in the
h hour of day
d in month
m can be expressed as:
Here, represents the grid where user u lives, represents the grid where user u stays in his trajectory, represents the stay time, which is recorded in second, of user u in the grid within the h hour of day d of month m, and the unit of the final PM exposure is ug·h·m.
4. Discussion
In our experimental segment, we calculated the exposure suffered by residents at the individual level. In addition to calculating the average exposure of all residents, we also conducted further analysis of the individual exposure in the temporal and spatial dimensions. We also analyzed the results obtained from the two different methods of calculating exposure. In
Figure 10, we not only show the discrepancy between two different exposure estimating methods but also illustrate the geographical environment around Shanghai. Additionally, We collect data on population, industry, and urban construction for each administrative district in Shanghai from the official website of the Shanghai Bureau of Statistics (
https://tjj.sh.gov.cn/tjnj/20170629/0014-1000201.html accessed on 9 December 2023) for the year 2014. These data are exhibited in
Figure 11.
The result in
Figure 10 indicates that the exposure calculated based on stay behavior is slightly higher than that calculated based on the residence in eastern Shanghai. However, the outcome is the contrary in western Shanghai. The spatial distribution of pollutant concentration in Shanghai is the main reason for this result.
Figure 11a shows that the density of industrial enterprises in the eastern region of Shanghai is relatively low, and
Figure 11c reveals a high proportion of green space in the same area. Moreover, according to the geographical location of Shanghai shown in
Figure 10, the eastern part of Shanghai is near the sea and can benefit from sea breezes. These factors provide a good explanation for the west-high and east-low trend in the spatial distribution of PM
concentration in Shanghai as shown in
Figure 2.
Due to the influence of this pollutant distribution trend, when we estimate the PM exposure, we always find that residents living in the eastern part of Shanghai suffer lower levels of PM exposure than those living in the western part of Shanghai, irrespective of whether their spatial movement was considered or not. Residents in the Pudong new area have the lowest average PM exposure per hour, and residents in the Jinshan district have the highest PM exposure. Although mobility behaviors can somewhat reduce the effect of the residential environment on individual PM exposure to be closer to the overall average, our experimental results show that the individual PM exposure in Shanghai is still highly correlated with the residential environment. If the concentration of PM around the user’s residence is high, then his overall exposure to PM is high. And vice versa, if the concentration of PM around the user’s residence is low, his exposure is low.
Specifically, from the temporal perspective, our experimental results indicate that individuals have a relatively high exposure to PM
during morning and evening rush hours. Therefore, the government can introduce policies to encourage the public to travel green during commuting by implementing measures such as moderate vehicle restrictions and constructing more bicycle lanes. Besides, the pollution exposure levels in January are significantly higher than in February, March, and April, highlighting the severity of pollution problems in winter. From the perspective of environmental sustainability, therefore, it’s necessary to promote winter pollution prevention and control initiatives further. For example, government departments can actively promote to residents the use of renewable energy sources such as biomass, solar, and geothermal energy for heating, so as to reduce coal burning. From the spatial perspective, as we have introduced above, the PM
exposure levels of residents in Shanghai display a distinct pattern of higher in the west and lower in the east. One reason is that the pollutant concentration is relatively higher in the western part of Shanghai. Additionally, as shown in
Figure 11b, the proportion of residents engaged in industrial production is higher in the western part of Shanghai. These residents are exposed to higher levels of pollutants during their daily work, which is also a reason for the higher average exposure level in the western part of Shanghai. Therefore, policy-making should prioritize addressing pollution control efforts in the western regions of Shanghai, reducing the generation of pollutants such as PM
from the source. Residents living in the western region can also consider equipping their homes with air purifiers to mitigate the health impacts of pollution exposure.
Furthermore, we noticed a limitation during the collection of concentration data from stationary monitoring stations in Shanghai. These pollutant concentration monitoring stations are mainly concentrated in the city center, while the suburban areas lack monitoring stations. This imbalance may affect the accuracy of the overall study results. Therefore, we believe that cities should pay attention to the balance of monitoring station selection when setting up pollutant monitoring stations. Pollutant concentrations in urban centers are highly variable and higher on average, which is why monitoring stations are more concentrated in urban centers. However, monitoring pollutants in suburban areas can help researchers better study the overall distribution of pollutants and the exposure of residents. Therefore, additional pollutant monitoring stations in suburban areas can facilitate air pollution research and provide more reliable health guidance to residents living in the suburbs. At the same time, low-cost air quality sensors [
33,
34] might present a significant solution to the uneven spatial distribution of monitoring stations.
5. Conclusions
In this paper, we initially apply a clustering method to recognize users’ stay behavior based on their CDR data and propose a reasonable approach to estimate the high-resolution PM concentration in every 1-km grid every 1-h slot. Subsequently, we propose a big data analysis framework for individual exposure estimation, which is the main work of this paper. This framework can quantify large-scale estimation of individual exposure based on users’ stay behaviors and high-resolution PM concentration data.
When it comes to future work, we believe it is possible to further differentiate user dwell behavior, calculate different exposure levels for indoor and outdoor spaces, and even more accurately assess the impact of whether the user wears a mask on exposure estimation. Additionally, we believe that we can further improve the precision of individual exposure estimation by calculating the exposure of an individual’s transition based on the detailed travel behavior of the user. However, these efforts require finer-grained user behavioral data and detailed trajectory data. In a word, this paper proposed a novel individual exposure estimation framework, offering fresh viewpoints and substantiating data to guide the development of environmental policies for mitigating individual-level pollutant exposure.