1. Introduction
The current dynamic nature of cities has generated substantial changes in urban environments, making the analysis of geo-spatial information more complex [
1]. The urban management and planning of cities has been nourished for years by Geographic Information Systems (GIS) tools, since they allow storing, modeling, and analyzing geo-spatial data. On the other hand, urban planning faces challenges such as organizational changes, staff availability, updating resources, and data availability [
2]. In this context, land use maps are one of the most demanded resources by authorities and researchers, for conducting urban planning and environmental sustainability studies [
3]. However, the production of these maps is costly and time-consuming [
4]. Consequently, the production and management of land use maps evidence a technical problem [
5] caused by economic constraints, mostly in developing countries.
The Internet generates large volumes of data at a high rate, in particular, posts on social networks. Social networks have become a source of data for researchers, because these data contain information that allows studying the interaction of citizens with their environment [
6]. The concept of “user-generated content” [
7,
8], which encompasses how citizens share excerpts from their daily lives, can be translated to different application domains: the representation of urban boundaries, relationships between weather conditions and traffic, citizen’s activity patterns, urban transportation behavior and land use [
6,
9], urban planning, and spatial planning. Therefore, social networks are a valuable source of data that could be used to identify interesting aspects of a city’s land use.
Although social network data have numerous semantic adulterations, and are not intended to be a source of geo-spatial information, in the text of posts we find pieces of important information about how people interact with their space, and can complement other sources in the process of identifying land uses [
10]. These data can be composed of various fields such as message text, location, and images. In this context, various approaches have been developed and applied to analyze this data type; the semantic analysis of the text of the messages—unstructured information—being the basis of much of the research related to social networks. However, it is an underexploited resource in obtaining and analyzing geographic information [
11]. Current land use identification and analysis studies rely primarily on publication metadata, such as time and geographical coordinates—structured information.
The various studies on land use classification have rarely considered the content of posts as a source of data to infer the type of use. In the content of social network posts, people communicate attitudes, activities, relationships, and other emotional states that are often complemented with a geo-spatial location. In written expressions it is possible to find words that allow inferring information about where a user is situated. For example, if a person posts the following message “having lunch with friends” on a social network, we could assume that they are in a restaurant, a house, a shopping mall, or another place where they can have lunch. In addition, if we associate two or more words such as “having lunch with friends in a restaurant”, the connector “in” and the word “restaurant” provide us with a high probability that the place where he/she is located is, in fact, a restaurant. If the coordinates associated with the message are included, we can know the space where the restaurant is placed, and consequently, the type of land use to which that space belongs. Therefore, the potential of the content of social network publications to address the study of land use is remarkable.
Nevertheless, the language used in social network publications is very informal, includes many idioms, expressions specific to each region, misspellings, or other language deformations. For this reason, the processing of this type of information with specific objectives becomes an important challenge [
12], in which numerous algorithmic techniques will concur, to form a methodology capable of extracting the valuable knowledge.
Most studies on land use are mainly oriented to classification tasks. When we use a data source such as the expressions of a language, it is necessary to propose a methodology that starts with data collection, followed by pre-processing and, finally, classification. The fundamental reason is that the expressions in each language and country—and even region—are different. For example, in order to carry out the present study, it was necessary to create a corpus and a dictionary for the Spanish language and for expressions specific to Peru.
This paper proposes a methodology for the identification of land uses using Natural Language Processing (NLP) from the contents of the popular social network Twitter. It will be approached by identifying keywords with linguistic patterns from the text, and the geographical coordinates associated with the publication. The methodology is composed of the following stages: data collection, corpus and dictionary creation, pre-processing, feature extraction, learning, prediction, and representation. At each stage, context-specific innovations are introduced to deal with data from South America and, in particular, the city of Arequipa, Peru. The objective is to identify the five main land uses: residential, commercial, institutional-governmental, industrial-offices, and unbuilt land.
Within the framework of urban planning and sustainable urban management, the methodology proposed in this study contributes to the optimization of the identification techniques applied for the updating of land use cadastres, since the results achieved an accuracy of about 90%, which motivates its application in the real context. In addition, this model would allow the identification of land use categories at a more detailed level, in situations such as a complex/mixed distribution building based on the amount of data collected. Finally, the methodology makes land use information available in a more up-to-date and, above all, much less costly way.
The document is organized as follows: first, a review of the recent scientific literature related to the use of machine learning techniques applied to the categorization of land use, and specifically involving data from social networks, followed by an overview of the NLP-based methodology; next, a description of the application context in the city of Arequipa, containing examples of the results provided by the main phases of the methodology, together with the results achieved in comparison to several models; then, visual examples are presented to illustrate the contrast between results from our approach and what is registered in the cadastre; finally, conclusions and future work.
3. Materials and Methods
The area of study is the city of Arequipa, which is the second most populated urban area in Peru, with more than a million of inhabitants and a yearly growth rate of 2.3%, according to the last census conducted in 2017. This study covers the historical center of Arequipa (see
Figure 1), with an approximate area of 3.46 square kilometers and about 2251 properties located within 56 blocks. The categorization made is based on the real estate properties declared in the area.
The Master Plan for Arequipa’s Historic Center and Surrounding Area (PlaMCha 2017–2027) describes the geographical characteristics of the historical center, which has three defined zones; old zone, monumental zone, and buffer zone located in the coordinates S: 16
23
53.33
W: 71
32
12.67
[
31]. The oldest area, where the main monuments and the Main Square of the city are located, is called Damero (transl. checkboard) and is constituted by blocks of 111.4 m per side and separated by streets of 10.3 m, which gives the characteristic unitary image and covers an area of 1.41 km
(141 ha.). The Monumental Zone has 2.12 km
(212 ha.), and the Buffer Zone has 3.46 km
(346 ha.) and covers the first 2 zones.
The historical center was selected because it brings together different types of land use such as historical, artistic, cultural, workplaces, residential, commercial, and public spaces. The large influx of tourists and residents in this area makes it possible to collect a large amount of data from social networks, unlike in other parts of the city.
3.1. Data Sources
3.1.1. Types of Land Use
The land use map of the historical center of Arequipa, made by the PlaMCha, defines 14 categories of land use (see
Table 1—left). However in this work, the re-categorization carried out by [
13] is considered, which initially takes into account 48 categories of land use and then groups them into 5: residential, commercial, industrial-offices, institutional-government, and unbuilt land (see
Table 1—right). The residential category groups houses, apartments, and condominiums. The commercial category comprises properties that carry out typical activities of stores, bookstores, shopping centers, restaurants, bars, hotels, lodgings, parking lots, and entertainment venues. The industrial-offices category groups industrial, companies and offices land uses. The institutional-government category includes buildings related to education (institutes, academies, schools and universities), health (hospitals and clinics), cultural (cultural centers and museums), management (administrative, financial or governmental grounds), and religious centers. Finally, the category of unbuilt land is mostly composed of agricultural land (crops), vacant land (buildings that do not exceed 1% of the surface), the Chili river course—which runs north–southwest through the historical area.
3.1.2. Cartographic Map
This study focuses on determining the categories of land use in the historical center of Arequipa; hence, it is necessary to define the boundaries and characteristics of the area as a function of polygons.The information is obtained from the documents developed in the PlaMCha, and the cadastral maps of the historical center developed by the Technical Team of the Municipal Planning Institute in the Pilot Project named “Altura para la Cultura” (transl. Height for Culture) [
31]. Cadastral maps were provided as GIS data, representing set of polygons for blocks and lots in the historical center (see
Figure 1).
3.1.3. Geo-Tagged Twitter Publications
Tweets allow the categorization of land uses where tweets were generated. A tweet is a short message of 280 characters maximum composed of text, emojis, and attachments, published on the Twitter platform, which is considered one of the largest sources of information fed by millions of users [
32]. Twitter Application Programming Interfaces (APIs) allow capturing tweets encoded in JavaScript Object Notation (JSON) format with their associated attributes and values. A tweet can have around 150 associated attributes, although this research only requires the ID, user ID, text, timestamp, geodata, and language of the message. All collected tweets contained the precise location (latitude and longitude).
3.2. Experimentation
The methodology is divided into a sequence of tasks, which begin with the data collection, followed by splitting the data for building the corpus and for validating. The validation data go through a previous classification using PoS patterns before entering the classification algorithm. For clearer interpretation, the results of the tagged data are presented on a map of the study area according to the related coordinates of the tweet. Finally, quality metrics are shown to validate the approach. The methodology is graphically illustrated in
Figure 2.
Data Collection
In this phase, the model connects to the Twitter Streaming API using the Tweepy library to capture geo-referenced tweets across South America. The Streaming API has certain limitations, but it allows the download of 100% of tweets that meet the defined filter, as long as they are less than 1% of the global volume of publications at a given time [
33]. The application was run on a server for a period of 10 months (from April 2019 to February 2020—up to the beginning of the pandemic) and data was stored in a MySQL database.
A corpus related to land use categorization was not available in Spanish, so it was necessary to build one. The collected data was divided into two sets: the first set consists of 3870 tweets located in the historical center of Arequipa, that will then be used to identify the type of land use. The second set is used for the creation of the corpus and consists of 42,318 tweets located anywhere in South America.
These tweets are previously processed to remove the duplicates, blanks, single word, and only numbers. As a result, we obtained a total of 24,995 messages that were semi-automatically categorized into a land use type according to their content and geographical coordinates, leaving a total of 4538 useful tweets in different languages, which were randomly split into training and test. The result was distributed into the defined categories and divided into subcategories to avoid data imbalance (see
Table 2).
Semi-automatic categorization consisted of two phases: automatic categorization using the Freeling library (successful in about 73%), and manual categorization of the remaining tweets, examining the presence of keywords (restaurant, school, park, house, etc.), along with the occurrence of specific sequence of tags (e.g., verb-preposition-noun).
3.3. Pre-Processing
The pre-processing is divided into two phases: the first phase consists of cleaning the text to remove noise and correct data, and the second phase is the data filtering to identify only the tweets which are within a block of the historical center. The tweets to be used as classifier input go through the two pre-processing phases, while the corpus tweets only go through the first phase. The steps are detailed as follows:
Noise removal, language detection, and Spanish translation. URLs, symbols, and HTML tags are removed [
23,
34]. For translation and detection, the
Googletrans library for Python is used, which allows processing large amounts of records without a query limit; the translation is of the entire text and not word by word so that the result is a meaningful sentence, as Google Translate Ajax API does. Hashtags and mentions are excluded from the translation.
Hashtag processing. Hashtags are words that tag a tweet to a topic that is generally related to its content [
35]. In this study, these words are preserved and translated individually if they are in English.
Elimination of punctuation marks and replacement of mentions. All punctuation marks are removed, except for the symbol
@ which is replaced by the word
in, which is useful in the context of location extraction from a text [
8]. However, this action requires additional processing with a dictionary, elaborated with frequent accounts in the corpus and others specific to the region, so that this substitution provides value to the classification.
An example of the result of applying the first three techniques is shown in
Table 3.
- 1.
Processing of Abbreviations, Acronyms, Slang, and Establishment Names. The most recurrent abbreviations and slangs within the corpus are identified and stored for the construction of a dictionary. The names of establishments, institutions, and acronyms identified from Google Maps are added to this list, which includes bars, pharmacies, hotels, universities, etc. located within the Historical Center of Arequipa. The resulting dictionary is used to identify these words in the text of the publication and replace them with the related word [
27]. The result is shown in
Table 4.
- 2.
Spell Checking. The open-source tool Hunspell is used (embedded in LibreOffice, Mozilla Firefox, Thunderbird, and Google Chrome, and some proprietary software packages). It is a spell checker designed for languages with complex morphology, such as Spanish [
34]. The concept of
Longest Common Subsequence (LCS) is applied to each of the options proposed by the spell checker to improve the result. Hunspell checks that each token in a publication is a valid word in the language; otherwise, it is replaced if the LCS value of the word proposed by the proofreader is not less than 71%; if it is less, the word is deleted (see
Table 5).
- 3.
Stopwords. All stopwords are removed, except the words “en” (transl. in) and “de” (transl. from) because these are spatial indicators that help identify that a user is posting the tweet from a location.
- 4.
Lemmatization and PoS Tagging. All publications are pre-checked and cleaned before lemmatization for better results. We used the Freeling tool, which is a library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, etc.) for a variety of languages [
36]. The results are shown in
Table 6.
After pre-processing, the tweets belonging to the application data are selected according to their geolocation (South America). Each tweet positioned in the radius of the city of Arequipa goes through two conditions: the first condition selects the tweets that are located within the polygon that represents the historical center, and the second condition selects the tweets that are located in a polygon that represents a block. To perform the first filtering, the
shapefile containing the representation of the historical center is imported into the PostgreSQL database and the tweets geo-positioned in Arequipa (ID, latitude, and longitude) are loaded into a
tweet_gps table. To select the tweets that are located in the polygon, the coordinates are transformed into
geometry type data with the function
st_geomfromtext. For instance, the SQL query used returned a total of 2343 tweets distributed in the plane as shown in
Figure 3. For the second filter, the polygons of the blocks are stored in the PostgreSQL database as
geometry type data. Tweets that are located in the middle of the street or outside the historic center are removed, providing 924 tweets (see in
Figure 4 a zoomed area).
3.3.1. Feature Extraction
Feature extraction deals with the transformation or addition of features to obtain a different, possibly larger and enriched feature space. Using only one feature extraction method does not guarantee the best results according to [
10]. As a consequence, this research explores the use of TF-IDF values associated with a multidimensional vector, the
n-grams (unigrams, bigrams, and trigrams), which have shown to work well in the classification of documents [
37], and
Bag-of-PoS that works similarly to the
n-grams, but based on the sequence patterns of the Part-of-Speech labels.
For example, according to [
38], the
“Verb-Preposition” pairs are commonly followed by the name of the place where a user is located, which allows it to be used as an indicator to identify whether the user is talking about an establishment or location. Once the data is numerical, the categorization of tweets becomes a standard machine learning classification problem.
3.3.2. Classification
The Multinomial Naïve Bayes Classifier (MNB) is one of the most popular algorithms in social media data and text categorization [
24,
37]. It is based on the evaluation of the probabilities of each class. Each tweet in the corpus has associated vectors with the extracted characteristics and their respective label of the land use type; therefore, the classifier is trained with each one of the characteristics, evaluating them individually, and later different combinations are examined.
Unlike applying the classifier to the corpus data which was filtered in the labeling process ensuring that they all refer to one location, when using the classifier with the application data, there is a need to previously identify which tweets refer to the user being in one location and which refer to other topics. For that purpose, with the tagged and untagged corpus data, we selected the PoS sequences with Bag-Of-PoS and identified the
i-most frequent sequences that represent each data set, as it is done in [
39]. Depending on the presence or not of the sequences of each class in the text ([it is/it is not] in a location), the tweet is classified by the MNB approach.
3.4. Results
3.4.1. Land Use Categorization
Class imbalance and expected results are some of the considerations taken into account in the selection of metrics to measure the performance of a classifier. There is no one metric that measures well the performance of a classifier in every scenario. In this research, the classifier is of the multi-class type (many possible results that are mutually exclusive) [
40], so it is also important to calculate the metrics for each class
individually, and then calculate the classifier metrics generally, as proposed in [
27]: accuracy, precision, recall, and F1-score. Since the classes are unbalanced (some classes have more data, so they are more likely to appear than others) the prioritized indicator is the F1-score [
37]. Results for each model are shown in
Table 7.
The corpus is used with two variations: one, with the text in its lemmatized form and the other with the original text, without lemmatization. The values achieved with these two variations are very close to each other, but the use of the lemmatized text always provides slightly better results (see
Table 7, where two rows for each feature is shown, with and without lemma). Furthermore, comparing the results of the models it is clear that by using all n-grams (unigram + bigram + trigram) together the highest accuracy is achieved. Therefore, this model is chosen for the classification of application data resulting in tweets labeled to be positioned in a Land Use Map (a sample of the results achieved with real data is shown in
Table 8).
Out of the 924 tweets used by the classifier, 327 were classified in the commercial category, 248 in the institutional category, 35 in residential, 14 in unbuilt land, and 8 in industrial. There were 292 tweets left that did not reach the cut-off point established for the classifier, so they were not labeled (non-classified) assuming that the text does not refer to a location.
3.4.2. Results Visualization
The tweets were also presented on the cadastre map of the historic city center. In this way, it is possible to compare the land uses identified by the classifier against the land use category registered in the cadastre by the local municipality.
The map of the historic center is shown in
Figure 5 (on the left) with the classification of land uses according to the city’s cadastre. On the map, each category of land use was associated with a color. Thus, the use of residential, commercial, industrial-office, institutional-governmental, and unbuilt land have the colors yellow, red, blue, light blue, and green, respectively. The second map in
Figure 5 (on the right) shows the tweets labeled with the categories of land use. According to the information shown on the map, a large number of land uses corresponding to commercial, industrial, unbuilt land, and institutional-governmental were identified. However, a small number of tweets with the residential label is observed, although, according to the cadastre, a significant percentage of land uses in the historic center corresponds to residential. Therefore, there is no correspondence on the residential category between the cadastre and the automatic classification of land use.
In order to deepen the analysis of the inconsistencies revealed between the two methods,
Figure 6 and
Figure 7 clearly show interesting differences. Color scale for land use is the same for both, as depicted in the legend.
Figure 6 illustrates areas (light blue circle) labeled as residential in the cadastre, but identified as vacant land or institutional. On the other hand,
Figure 7 shows a block (yellow) in which is located the Santa Catalina Convent, and it is cataloged as residential in the cadastre. However, tweets indicate the potential use as unbuilt or institutional-government land (religious is a subcategory within the last one). Multiple cases of inconsistency were detected when comparing cadastre and classifier outcome, many of which were later verified by a site visit. In general, the results provided by our approach were much more accurate than those registered in the cadastre.
4. Conclusions
Social networks provide valuable data on urban dynamics, offering new opportunities for research in the field. In this study, we used Twitter data to analyze land use in the historical center of the city of Arequipa by capturing tweets from the area with the text of the publication, time, date, and coordinates.
This research proposes a complete methodology of NLP for the analysis of tweet texts and coordinates, together with the Naïve Bayes Multinomial algorithm for the classification of spaces within land use categories. The evaluation of the model shows that the approach provides excellent results, with accuracy of about 90%, and F1-score of about 88%. The information of the area obtained from the project “Height for Culture” is used as a basis for the interpretation of the results, verifying that, as expected due to the knowledge of the area, a large percentage of the properties belong to the commercial category because these are in the historical city center with high presence of tourism, followed by the buildings of the institutional-cultural category due to the historical character of the area. However, the methodology detected that many residential spaces registered in the cadastre have currently other activities or uses.
We conclude that the Twitter data provides useful information to identify land uses in the geographic area where it is captured. Tweets are gathered in a simple and inexpensive way and provide information that can be used as an additional method by urban planning professionals and organizations interested in that area.The advantage of this model over the traditional ones resides in its dynamism since it uses data that is constantly updated by the users and allows reflecting the inconsistencies that exist in the maps generated by the cadastre due to the constant change of the environment. Moreover, the methodology is easily transferred to other geographical areas by means of the use of specific dictionaries to the region under study. This knowledge might be used as a recommendation system for short-term supervision or updating of the cadastre.
Finally, this method depends on the amount of geo-located data available at the time of classification, so the capture of publications from other social networks should be considered for future work to maximize the effectiveness and usefulness of the results.