1. Introduction
In recent years, social media has dramatically expanded in popularity around the world and became an integral part of the information ecosystem in both application and research perspectives due to its unprecedented reach to masses (i.e., users, consumers, businesses, governments, and nonprofit organizations) [
1]. Historically, traditional data sources (i.e., survey, census) [
2,
3] are analyzed to observe human activity behavior [
4,
5], lifestyle patterns [
6], and gender differences [
7], but these data sources are considered to be more expensive both in the collection and the analysis. This, in turn, requires more processing time and results in data sparsity. Policymaking and delivery of services are closely intertwined with city planning and human mobility. However, due to the limitations highlighted above, these traditional methods are considered to be less effective in policy-making and delivery of services [
8].
A considerable amount of previous research [
9,
10,
11,
12,
13,
14] studied the demography of social media users and discussed reasons that influence people to use LBSN. These recent research studies highlighted the motivations for social media network use among both male and female users. Smith [
9] argued that female users tend to use online social media to interact with families and friends rather than male users. Muscanell and Guadagno [
15] put forward that male users mostly use online social media for making new relationships while female users utilize social media more for the maintenance of the relationship. Moreover, the pattern and motivation to use the social media network by both male and female users seems to be slightly different. Hwang and Choi [
16] explored the online usage behavior of Weibo by college students and the motivations of usage by gender. Lastly, it was suggested that online usage behavior of Weibo acts as a platform to search for information on social issues and interests. Rossi and Musolesi [
17] proposed methodologies to identify the unique users from check-in data and characterized the users by the spatiotemporal trails from the check-ins made over time and the frequency of visit to specific locations [
16,
18].
As part of online social interaction in LBSN’s [
19,
20], users [
21] can announce their geo-location [
22], announce the activity performed [
23], and discuss places they visit (referred as “check-in” [
24]). By the third quarter of 2017, Weibo [
25] amounted up to 376 million monthly active users (MAU), 172 million daily active users (DAU). Among the active users, 93% were accessing Weibo through mobile devices [
26,
27]. This enormous number of users were attracted worldwide due to fast information sharing and check-in phenomenon [
28], which generates high volumes of data (referred to as “Big Data” [
29,
30]). Irrespective of fundamental limits to demonstrate human check-in behavior [
31], i.e., prejudice of gender, frequency sampling prejudice, and location type prejudice. Check-in reveals human check-in behavior in space and time. The motivations for using Weibo may differ between male and female users. Statistics show that 50.10% of Weibo users are male, 49.90% are female [
32], and it is considered one of the most popular social media platforms in China [
33] due to the unavailability of Facebook and Twitter. According to the China Internet Network Information Center (CNNIC) [
34], 72% of the total Sina Weibo users were 20 to 35 years of age. Among them, the majority of users are in their 20s and constitute the heaviest users [
16,
35].
Currently, LBSN data are obtainable at a relatively cheap cost with information such as timestamps, location, and gender [
36], and can be analyzed to perceive human check-in behavior as equated to the previously stated traditional datasets. Intrinsically, LBSN data offers new dimensions to help and create new techniques and methods to observe human check-in behaviors [
37] and differences in gender. In the current study, we explore the LBSN data to observe human check-in behavior and intensity of check-ins during the period within a city at an individual level. Moving toward this direction, the research presented in the current study aims to investigate the spatiotemporal information related to the check-in to identify and determine human check-in behavior. The simple hypothesis is that people follow a typical daily routine: e.g., go to work, eat at some preferred restaurant, go shopping, and go back home.
Consequently, if we have enough data to observe distinctive human behaviors, such knowledge can be analyzed to understand human behavior by using LBSN check-in information as a proxy measure. Recent research [
38,
39,
40,
41,
42] explored the LBSN datasets to examine people’s daily check-in behavior and mobility patterns in different cities rather than Guangzhou, China. However, most of the existing literature focused on Facebook and/or Twitter rather than Weibo. Therefore, this study will also serve to fill a research gap by focusing on the most popular Chinese local social network site, Sina Weibo, and study area as Guangzhou, China. Moreover, previous studies [
16,
18] explored gender-based check-in behavior analysis in Sina Weibo and suggested that women are more likely to use Sina Weibo to provide help and information to others. However, studies to date have not fully investigated gender-based check-in behavior analysis in Sina Weibo usage especially in Guangzhou, which also motivated the current study. The primary objective and contribution of the paper are twofold and can be summarized as follows.
The primary objective of the research is to characterize behavioral differences between male and female using the “check-in” function of the Sina Weibo (launched by Sina Corporation [
43] in 2009).
The main contribution of our work consists of examining the check-in density by using KDE. The GWR method was applied to observe the relationship between check-in and districts with a focus on gender during weekdays and weekend. Lastly, the standard deviational ellipse (SDE) analysis is used to analyze the orientation, direction, spatiotemporal expansion trends and the differences of check-in behavior by male and female in Guangzhou, China, which was missing in the existing literature regarding gender-based check-in behavior analysis.
Moreover, this line of research can help improve our understanding of human check-in behavior and consider LBSN data (a source of big data) as a supplement to than a substitute of traditional data sources while taking a decision on policy making [
44,
45,
46] associated with urban planning [
47,
48] and city functionalities [
49].
The organization of the rest of the paper is as follows.
Section 2 presents the literature review.
Section 3 defines the study area and dataset.
Section 4 presents the methodology.
Section 5 presents the results and discussion for the experimental results performed on the dataset. Lastly,
Section 6 concludes the paper and proposes some further research issues.
2. Literature Review
The research on spatial analysis has significantly progressed toward observing human behavior, which has long been constrained by traditional data sources with improved abilities to capture, analyze, and process LBSN data [
50,
51]. The terminology “social network site” (SNS) [
52] denotes to web-based services [
53] and is a social structure made up of individuals connected by one or more specific types of interdependency, such as friendship, common interests, and shared knowledge [
54]. It allows users to (1) construct a profile, (2) articulate users’ social links, and (3) track and view shared social ties within the system [
55,
56,
57]. Moreover, it reflects the real-life social networks among people through online platforms such as a website, providing ways for users to share ideas, activities, events, and interests over the Internet.
SNSs first emerged in the mid-1990s [
56,
58,
59,
60] as a simple mode of communication to interact with people over the Internet by using personal computers only [
61]. Recent technological advancements of “smart” mobile devices empowered users in a variety of ways in existing social networks by adding location dimension and providing a potential benefit to access social network accounts on personal computers along with mobile devices [
21]. Primarily, desktop computers were the modes to use, connect, and share information on SNSs [
62], but, with the introduction of smartphones, the access to SNSs became convenient to use, connect, and share information with their “friends” [
63] on the move [
64,
65]. With this rapid development of mobile phone technology, users can easily communicate and share information (i.e., text, audio, and video) progressively by using the geo-location [
66]. The development of LBSNs progressed with the integration of communication technologies [
67], which, in turn, provide fast sharing of information about what, where, why, and with whom users share information. LBSNs include geographic services (i.e., geo-location) and capabilities (geo-tagging) to assist in exploring social dynamics and make it an essential type of social networking [
20,
68], which allows the sharing of users’ current geolocation and discovering their friends’ location, which, in turn, raises users’ privacy concerns [
69,
70]. Privacy in LBSN primarily depends on legislative and business-oriented actors involved in data sharing even though privacy is not an individual issue. Yet, some of the personal data is shared unintentionally or willingly by the user [
71,
72,
73,
74]. Sometimes, the location is deliberately shared by users for the sake of benefits (i.e., customers can enter competitions, donate to charities, or earn additional loyalty credit) in exchange of information, branded hashtags, check-ins, or experiences on LBSN [
75].
Various studies based on LBSN datasets to observe human check-in behavior under domains like privacy [
73,
76,
77], gender differences [
78], geographic spaces [
56], urban emotions [
79], activity location choice, lifestyle patterns [
6,
80,
81,
82], and operations and production management [
83] have been conducted. Li and Chen [
63] studied location sharing by the users in the real world, and presented data analysis results over user profiles, update activities, mobility characteristics, social graphs, and attribute correlations. Benevenuto et al. [
84] analyzed the frequency and duration of social network connectivity, as well as the users, conduct of different activities on these sites differentiated by types and sequences. Chang and Sun [
85] analyzed the LBSN dataset to point out the influence of factors where users check-in, including historical check-ins, similarity to historical places, where their friends check-in, time of day, and demographics. Lei et al. [
86] spatiotemporally analyzed the LBSN dataset to observe the human dynamics regarding differences in gender, behavior in check-in, and online time duration in Beijing’s Olympic Village. Moreover, it argued that female users are more likely to interact in social media in comparison to male users. Hu and Zhang [
87] utilized clustered spatiotemporal data and suggested a selection method. Moreover, exploratory spatial data analysis (ESDA) is performed to acquire the datasets with the prospects of quick grouping by mining the Weibo check-in data. Saleem et al. [
88] explored the prominent locations and introduced a method of location influence with the ability to reach out geographically by using LBSN data. Furthermore, a memory-efficient algorithm was proposed, which resulted in efficient and scalable diverse sets of locations with a broad geographical spread. In addition, previous research [
89,
90,
91,
92] focused on observing human mobility patterns and analyzing check-in data for location prediction and venue tagging in the city by using LBSN datasets. While References [
37,
93] mainly focused on examining the factors that can predict the uses and patterns of using LBSN.
Many applications utilized the concept of automatic venue tagging to observe spatial differences [
94,
95]. While Gao and Liu [
96] argued that temporal features and ranking of a user’s geo-location history are considered to be irrelevant with the integration of human mobility in LBSN. Yang et al. [
97] explored check-in behavior and mobility patterns by analyzing the spatiotemporal distribution of geotagged social media data messages and activity patterns. Moreover, References [
92,
98,
99] analyzed the large LBSN datasets to study the variation of urban spaces and observed the spatial characteristics of the social networks, which may arise in LBSN users. Muscanell and Guadagno [
15] examined the impact of gender and personality on the use of Facebook and MySpace and reported that male users use social media for relationship formation while female users use social media for relationship maintenance. Moreover, female users are reported low in agreeableness while using instant messaging more often than male users is high in agreeableness, whereas male users are reported low in openness while playing more online games compared to female users are high in openness. Rzeszewski and Beluch [
100] addressed the gap (representation and representativeness) in data by investigating the LBSN users, based on the spatiotemporal distribution of the content produced (demographics of the user population). While Guan et al. [
101] studied the concentration and significance of users’ thoughts on Sina Weibo and Feng et al. [
102] analyzed China’s city network based on users’ friend relationships and check-in behavior on Sina Weibo.
LBSN datasets have been exploited in various research studies for the urban development and its environmental hazards [
103], expansion and exploration [
104,
105,
106], travel and activity patterns [
107,
108], and disaster management [
109,
110,
111], emergency mapping [
112], Special Event Population [
113], and urban sustainability [
114]. Hong [
115] highlighted various factors to observe the payment patterns and willingness of buyers by utilizing the LBSN dataset. Mazumdar et al. [
116] proposed a prediction model, which gathers surreptitiously visited locations from an available user trajectory. Moreover, the relationship between a user’s checked-in data for predicting the unchecked or hidden locations was investigated. Dokuz and Celik [
117] proposed a method to discover the user’s historical data and measures based on communally important locations for each user’s (individual’s) preferences. Furthermore, an algorithm was proposed that was compared with a naïve alternative using real-life Twitter dataset. Fiorio et al. [
118] developed a methodology for parsing the population-level migration signal from individual-level point-in-time data using flexible time-scales. Moreover, a stochastic model was proposed for simulating patterns in digital trace data and test it against three datasets: geo-tagged Tweets and Gowalla check-ins. Wu et al. [
119] analyzed the impact on housing prices when neighborhood land uses are mixed. By using geographic information system data, three quantitative measures of the land-use mix were created, and these measures were computed for various neighborhoods in Beijing’s central city. The research base on check-in behavior analytics is useful to know about gender-based human check-in behavior, but, under the scope of the current study, the connection with other indicators of gender equality [
120,
121] are not considered.
3. Dataset and Study Area
The dataset mined in this study was obtained from Sina Weibo. It covered the Guangzhou area for the period between January and May 2016, which contains 852,560 check-ins from 20,634 users. Guangzhou is considered to be one of the most attractive destinations in China due to its heterogeneous population and job opportunities regarding demographic characteristics [
122], socioeconomic status [
123,
124], and place of origin [
28]. Guangzhou, China (longitude from 112°57′ to 114°3′E and latitude from 22°26′ to 23°56′N [
125]) is located on the south coast of Pearl River Delta (PRD) with a 14.5 million population [
126] and had a total area of 7434.4 km
2 [
127]. In 2015, Guangzhou was divided into 11 districts (Baiyun, Conghua, Haizhu, Huadu, Huangpu, Liwan, Nansha, Panyu, and Zengcheng) [
128]. Six of the districts (Baiyun, Huangpu, Haizhu, Liwan, Tianhe, and Yuexiu) are denoted as the center of the city [
129,
130], as shown in
Figure 1.
The Weibo dataset used in the current study contains information like the unique id of user id, time, and date of the check-in. Additionally, geo-location (longitude and latitude), venue type, venue category, and gender collected via the web or mobile applications [
131]. Therefore, it is assumed that the LBSN dataset archives the day-to-day activity patterns, usage behaviors toward social media, and presents spatiotemporal evidence, which is related to the daily routines of users [
114]. A typical Weibo “check-in” is represented as: check-in (1305141104 006810) = {5503767214, ####, 1305141104 006810, Fri Apr 22 09:37:03 +0800 2016, m, 113.854085, 23.527322}. Where 1305141104 006810 denotes “status_id,” 5503767214 denotes “user_id,” #### denotes the “user_name,” Fri Apr 22 09:37:03 +0800 2016 denotes “day, month, date, time and year,” m denotes “gender” and 113.854085, 23.527322 denotes geo-location.
4. Methodology
In the current study, we analyzed the Weibo based geo-location dataset (Jan-May 2016) from Guangzhou, China.
Figure 2 presents the check-in behavior analytics framework, where the LBSN data analysis methodology involves the two stages: collection, storage, and analysis of LBSN data. The download of Weibo data is the significant step of Weibo data collection and storage stage. To collect check-in data, we implemented a multi-threaded crawler to access the Weibo API. In turn, the crawler collects the check-in data filtered by gender, and the results are processed with entries that have geolocation. The outcome is in single JSON (JavaScript Object Notation) file by utilizing a python-based Weibo API (an open interface of Sina Weibo) [
132,
133], which is considered an extensively used data format [
134,
135]. To be adequately analyzed, the dataset is converted into a distinct file in the CSV (Comma-Separated Values) format so that the check-ins could be listed regarding their publishing time. However, the critical task in the data analysis stage is to mine and investigate the features of LBSN data. Moreover, during data pre-processing, invalid records are excluded by considering four criteria points: (a) availability of information i.e. user id, date, time, gender, geo-location, (b) location of the records must be in Guangzhou, China, (c) the range of record is within the date and time, and, (d) as a minimum, each user checked-in twice a month. After pre-processing (noises, void records, and bogus users) of 903,008 anonymized check-in records, 852,560 check-in records associated with the geographical area are picked up between January to May, 2016. Lastly, the task in the data insight stage is to analyze and investigate the features of LBSN check-in data by considering location, time, and gender and visualize data by using ArcGIS [
136] to produce density maps [
137] and trends [
138,
139].
4.1. Kernel Density Estimation
In order to detect hot-spots and observe gender differences in check-in behavior, we estimated the density function of check-in using a kernel density estimation (KDE) [
140,
141,
142]. KDE is considered a popular spatiotemporal investigation practice that is used to observe the features of location (i.e., destination, time) comparative to each other. KDE is an evolving spatiotemporal means that has earlier been used [
143,
144,
145,
146] to examine several characteristics of the social media (but not limited to LBSN) data analytics such as users’ online activity and movement patterns [
38], check-in behavior [
147], city boundary definitions [
148,
149], and point-of-interest recommendations [
150]. Moreover, it examines the diffusion of destinations in neighborhoods, allows investigators to see where destinations are densely distributed, and where they are more intensely dispersed. Lastly, it attempts to produce a smooth density surface of spatial point events in the geographic space [
151].
The goal is to produce a smooth density surface that signifies the density of the point group. The algorithm is functioned by setting the search scope (window). The central grid of the window gives the weight of each grid unit to an outward grid, according to the principle of anti-distance weight. Moreover, in the window, the weights and density values are the sum of kernel density value that belongs to the central grid.
To measure the density of historical check-ins at point “
x,” let
(
x) be a density function at geo-location “
x.”
where “
x” represents the geo-location (longitude and latitude) of check-in dataset “1 <
i < n” at which density estimation with bandwidth “
h” is calculated. In KDE, bandwidth is considered an important parameter. If the bandwidth is too large, then the point density surface will become too smooth, while too small will change point density distribution abruptly. Therefore, the optimal bandwidth is determined by repeatedly setting the bandwidth and comparing the smoothness of the point density surface. Bandwidth “
h” is dependent on the resulting density estimate “
(
x)”. ”
X” is the total number of check-ins in the dataset, “
j” points to a signal geo-location, “K” is a standard normal density function, “
” denotes the Euclidean norm [
152,
153,
154], and “
” is the geo-location of check-in “
j.”
The log-probability data-driven option is used to assess the value of bandwidth “
h” in constructing the density estimate below.
where the “
xt“ events “
xj” are data points in the dataset X. Higher value of
L(h) is ideal since it shows that the higher probability is being allocated to new but invisible data. Hence, a simple method for bandwidth selection is to perform a grid-search on “
h” using a validation set.
4.2. Geographically Weighted Regression (GWR)
GWR is a spatial regression technique that considers spatial nonstationarity and allows local parameters to be estimated. It is considered an extension of a traditional linear regression framework, and is, accordingly, easy for the specification. Unlike the complex mechanism of the Bayesian spatial model, the GWR method is easier for researchers to understand and is widely used in a practical application. In particular, in the GWR models, the coefficients of variables can be visualized in an easily identifiable manner, which could provide insightful suggestions for city planners and check-in behavior analysis [
155]. A typical GWR model takes the following form.
where (u
i, v
i) represents the geo-coordinates (longitude, latitude) of observation
I, β
0 (u
i, v
i) represents the intercept value, β
k (u
i, v
i) represents the estimated parameter for the kth variable of observation
I, and ε
i represents the error term.
This means that the estimated coefficients are allowed to vary in space. One assumption of GWR is that the observed data near the observation
i have more influence in estimating β
k (u
i, v
i) than the data farther from
i. The parameter β
k (u
i, v
i) is estimated below.
where the weighting matrix W(u
i, v
i) is a diagonal matrix, and the off-diagonal elements are all zero. The estimation of GWR is part depend on the bandwidth selection for observation
i neighbors. For areas with more data points, the bandwidth of the kernel will be lower, while, for areas with few data points, the bandwidth of the kernel will be larger. In the current study, an appropriate bandwidth is selected based on the minimum Akaike information criterion for the GWR model (AIC) [
156].
4.3. Standard Deviational Ellipse (SDE) Analysis
The standard deviation ellipse (SDE) [
157] analysis is often used to depict the spatial characteristics of a geographical entity, such as central tendency, dispersion, and directional trends. SDE not only is an abstract expression for individual spatial distribution, but it also builds more comprehensive and realistic models of human mobility and online behavior [
158]. It is quite effective for a discrete description of anisotropic events in the spatial point pattern analysis, which has been widely used in extensive research such as urban structure analysis [
159]. This useful tool is chosen in this study to analyze the check-in behavior at a more detailed level.
There are four parameters of SDE that include the ellipse center, major axis, minor axis, and azimuth. The major and minor axes of the SDE are calculated according to Equation (5), and their proportional relations denote the degree of flattening the SDE. The rotating azimuth is calculated according to Equation (6), which reflects the main trend directions [
160,
161]. The standard deviations of the major and minor axes of the SDE are calculated according to Equation (7). The major and minor axes of the SDE form the spatial region of the check-in distribution, and the direction of the major axis is defined as the dominant direction of the variation trend [
162]. The ellipse center of SDE (
) is calculated below.
The azimuth
of SDE is calculated using the equation below.
The standard deviations of the ellipse
and
in the x and y directions are calculated using the formulas below.
where
in Equations (5)–(7) denote the deviation between coordinates of an element and the geometric center coordinates of an element set, which represents the spatial location of the object. In addition,
is the corresponding weight and
denote the coordinates deviation from the spatial location of each object to the ellipse center of SDE (
).
6. Conclusions
The current study utilized density maps and trends to present the spatiotemporal investigation of gender-based human check-in behavior to explore hourly and daily check-in patterns, as well as patterns during weekdays and weekends. Results show that, in almost all districts of Guangzhou, females are more inclined toward using Weibo as compared to males during the weekdays. However, during the weekend, almost the same check-in trend is observed by both males and females. Furthermore, the center of the city has a comparatively high density of check-in near the subway and highways.
With a supplement to the prior research on check-in behavior, we also consider LBSN data as a supplement rather than a substitute for traditional data sources to observe (i.e., human mobility, activity analysis, and defining city boundary and social issues in a city). Additionally, compared to other traditional data sources, the LBSN dataset has some advantages (low cost and high spatial precision) and disadvantages (i.e., the gender prejudice, a low frequency in sampling, and location type prejudice).
Lastly, based on the results, we consider the LBSN dataset as a novel source of big data with the potential to offer a new viewpoint as an add-on to observe the gender-based check-in density in space and time. The information from KDE can facilitate studying the dynamic evolution of check-in across both space and time. Additionally, the KDE results verify that the check-in behavior varies at fine temporal (i.e., a day) and spatial (i.e., a city) scales. The results also show that the check-in data can reflect more refined phenomena and results other than traditional data with fine time and spatial granularity. Despite the difference of methodologies being used with different types of LBSN and datasets, both early studies [
16,
18] and the current study based on gender-based check-in behavior on LBSN, draw a similar conclusion that the female users are more likely to use LBSN than male users. Additionally, it can be helpful for policymakers to define policies regarding the supply of services (i.e., transport, health, and entertainment) by highlighting the check-in hotspots in the city. The SDEs indicate the difference of trajectories by males and females in the district of Guangzhou and the trajectory of the SDEs center by males and females in districts of Guangzhou represents the differences in check-in behavior patterns in those districts.
In the future, we will tend to study the use of LBSN data to explore the activities associated with the check-ins and study the motivation toward those activities. Moreover, we will also tend to examine the aspects that bring change in the check-in behavior toward those activities in space and time.