1. Introduction
In Japan, the rapid influx of population and motorization has led to the spread of urbanized areas into low-density suburbs since the 1960s [
1]. Uncontrolled suburban development, known as sprawl, leads to inefficient infrastructure provision and maintenance costs for people living in the area [
2]. Suburban sprawl is associated with a decline in the residential population and commercial functions of urban centers; the expansion of residential areas has continued in the era of depopulation that began in 2008, threatening the sustainability of the regions. Ohashi and Phelps (2021) analyzed the suburban area of the Tokyo metropolitan area and argued for in-depth policy for this area, since the shrinking of the area is more complex and heterogeneous than its expansion [
3].
In 2014, a land use plan named the “Grand Design of Japan in 2050” was announced by the Ministry of Land, Infrastructure, Transport and Tourism (MLIT) in Japan, which proposed a “compact city” concept composed of spatially agglomerated bases for daily life and a network of frequent public transport connecting the bases. Miyauchi et al. (2021) analyzed suburban area size and urban population to find a stable relationship among them [
4]. In order to mitigate the uncontrolled expansion of urban areas or reshape them into a sustainable form, many local governments have adopted the “compact city” as a goal of urban structure in their master plan for urban design, and policies were developed to encourage people to return to the city center [
5]. Such policy making requires a detailed method of understanding the current state of land use.
The development of geo-statistical data and advances in GIS technology have made it possible to analyze land use and population dynamics in fine spatial units smaller than the municipality. In Japan, national land use data supplied by MLIT provide a land use dataset at the tertiary mesh level for multiple years across the entire land. In addition, the National Census and Economic Census provide data on population, households, and number of offices and employees, also at the tertiary mesh level. These standardized geographic information databases are useful for analyzing land use change and population dynamics, and have greatly contributed to the development of research on urban structure and urban policy [
6]. However, when analyzing an entire city or region, the data handling for the number of meshes and their many attributes becomes a nuisance due to the enormous amount of information. While a map for each attribute, for example, is convenient for grasping that attribute’s characteristics at a glance, a large number of maps for individual attributes cannot give insights for the comprehensive assessment of the target area.
In order to understand changes in urban structure, it is necessary to observe the changes in land use, population dynamics, commerce, and industry in a comprehensive manner, and to extract an outline of changes and notable areas [
7]. If we could develop an analytical tool that can accomplish this task, it would be of great help for grasping the changes in urban structure corresponding to the real functions of the target area.
There are several approaches for grasping the land use characteristics of urbanized areas. Conventionally, images observed from satellites have been used for land cover monitoring. Grigorașa and Urițescu (2019) estimated land cover classes to find their relationships with land surface temperature [
8]. This empirical study in Bucharest, Romania showed a significant decrease in vegetation areas, and a negative correlation between vegetation and land surface temperature, resulting in heat island phenomena. By using detailed mesh information from a city, Rahman et al. (2021) proposed an integrated index to quantify the compactness of neighborhood districts in order to assess the need of a neighborhood policy for each district [
9]. Renne et al. (2016) estimated a model to regress network accessibility and built environment on transit commuting share in cities in the United States [
10]. Pan et al. (2018) analyzed urban structure focusing on the distance to the transportation network by using an accessibility index on small land cell units [
11]. The estimated model clarified the effectiveness of public transport access. Guan (2019) analyzed the area development around public transport stations [
12]. Xu and Yang (2019) clarified the relationship between public transport access, including transferring cost and land use characteristics, by using a geographically weighted regression model (GWR) [
13]. Dadashpoor et al. (2019) classified landscape patterns to find the longitudinal change in the relationship between land space and land use, using GWR [
14]. In this study, the drastic changes in suburban areas were clarified. Zeng et al. (2017) integrated geographical big data such as point of interest databases and OpenStreetMap with SNS big data such as Weibo to evaluate land use efficiency based on GWR [
15]. Applying their model to 40 megalopolises in China, the potential and problems of the target cities were quantitatively clarified. Flores et al. (2019) proposed a novel approach for classifying land cover into groups of surface characteristics using images with very fine resolutions [
16]. This study proposed an efficient algorithm for defining the dictionaries for the classification using convolutional neural networks. A comparison of the proposed algorithm with other deep learning methods showed improved performance. Mohamed and Worku (2019) tried to integrate land cover with land use in order to clarify growth and sprawl around a metropolitan area [
17]. Hereafter, land use refers to land attributes such as demographic, social or economic activity data, as opposed to land cover, which is mainly observed by satellite images or aerial photographs of the land surface. Their study in Addis Ababa, Ethiopia, explored land cover classification and the resulting land cover map was overlaid with a land use map to find the correspondence or lack thereof of land development with the master plan of the city. Zhang et al. (2019) proposed an integrated deep learning approach enabling the simultaneous estimation of land covering with land use [
18]. In this study, a convolutional neural network with a multi-layer perceptron was applied.
The topic model is used for a topic estimation problem from the documents with a huge number of words. Among the statistical approaches, principal component analysis or factor analysis are well-known algorithms to make reductions of the original dataset. While the principal component or factor analysis is based on eigenvalue/eigenvector decomposition, the topic model is based on singular value decomposition with stochastic and hierarchical mathematical structure. The advantage of topic model is positive estimation of parameters, while negative estimates often appears in conventional approaches. Positive estimates of parameters make much easier interpretation for topic, and the mesh-topic load matrix is conveniently used for further analysis to find the relationship with other attributes. Tsukai and Tsukano applied the topic model to geographic data of 23 attributes, such as the area of various types of land, the number of population and households, and the number of business offices, and extracted eight topics as land use types [
19]. However, the transition of land use topic was not yet analyzed.
This study purposes to clarify the applicability of topic model to geographic data, by analyzing urban structural changes such as urban hollowing out and suburbanization from the geographically extracted “topics” (hereafter referred to as geographic topics) of the target urban area. A topic model originally used for text mining from documents is developed to apply for geographical data which gives geographic topics, in this study. The proposed model enables us to assess the transition of geographical topics. The model outputs will show how the target area has sprawled (or not sprawled) and enables to give the quantitative indices about the transition of land use. The target area of our analysis is Fukuoka and Kitakyusyu metropolitan areas because these metropolitan areas are adjacent, but their land use characteristics and transition is contrastive. Fukuoka metropolitan area has been grown in population, while Kitakyusyu metropolitan area has been suffered from depopulation. In this study, we set the empirical hypothesis as follows; the proposed model can successfully capture the characteristics of the couple of metropolitan areas. If the above hypothesis is accepted through the estimated geographical topics, we can confirm the model capacity.
2. Methods
2.1. Topic Model
The topic estimation problem is to make a summary from documents consisting of a large number of words, and the summarized information is called a topic. There are many topic models that have been proposed, and the basic idea behind them is shown in the probabilistic topic generation model using Dirichlet distribution, called Latent Dirichlet Allocation (LDA) [
20]. The graphical model of topic model is shown in
Figure 1. In LDA, topics are assumed to be latent variables and a document is composed of multiple topics. The following is an overview of the model discussed in our previous study [
19]. In the following model specification, we use the identical terms with those in text analysis for which the topic model was originally developed, for ease of the readers to refer to the papers.
Consider a set of D documents. Each document d consists of words, and the n-th word of document d is . The 1-of-V representation of a lexicon in a document is a representation in which a set of vocabularies are assigned a unique number in order, and when the lexical number of the n-th occurrence of a word in each document is v, the v-th element of the vector is set to 1 and all the others are set to 0. has N × V elements in the whole document, where N is the total number of words or is the sum of the number of words in each document .
In LDA, we assume that each lexicon belongs to a latent topic
. We assume that each document has a different topic distribution
, and each topic
k has a different vocabulary distribution
. If we ignore the order in which words appear in a document, and consider only the cohesion of words that co-occur in a document, these can be expressed by Equations (1) and (2) using the multinomial distribution.
To estimate the parameters of the multinomial distribution
and
, we assume its conjugate prior distributions by the Dirichlet distribution. These are expressed in Equations (3) and (4).
We define the topic distribution of each document as the document parameters
with
D rows and
K columns. The lexical distribution of each topic is defined as the topic parameters
with K rows and V columns. The superscript
t in the right shoulder indicates transposition. The simultaneous distribution of the observed data
and the latent variables
can be expressed in Equation (5).
For the sake of model comparison, let us eliminate the latent topics
and obtain the marginal probabilities for
W (
N rows,
V columns). This is expressed in Equation (6). In addition, the 1-of-V representation of
W is rewritten with the data
M in the bag-of-words representation of the document unit defined by Equation (7).
As shown in Equation (7),
M obtained by this operation will have
D rows and
V columns, then Equation (8) is obtained.
Here,
is the
d-th row vector of matrix
. Equation (6) can then be rewritten as a probability distribution over the data
M, as shown in Equation (9).
Equation (9) is a probabilistic model of the bag-of-words data M in units of documents, which is decomposed in the products of latent parameter matrices of . The simplified notation of this structure yields Equation (10). Equation (10) also shows that LDA is a matrix decomposition model that approximates the observed data M with a low-rank matrix whose rank is the number of topics K.
2.2. Application of Topic Model into Geographical Data
In this study, we apply the topic model to geographic information data to extract geographic topics. The advantage of the topic model is its flexibility to extract topics from a set of co-occurring vocabularies. On the other hand, geographic feature in an aggregated spatial unit (i.e., a squared-kilometers unit) is characterized by a vector of attributes in terms of land use, socio-demographic information and economic activities. As described in
Section 2.1, topic model requires many documents recording the set of co-occurrence vocabularies that is expressed in 1 of
V form. If the geographic data is successfully converted or processed into the form suitable for the input to topic model, we would utilize the higher ability of the topic model to extract the geographic topic giving an adequate feature of each location. For this purpose, we consider the correspondence of data format between a set of documents and the geographic data, documents and vocabularies in text data correspond to meshes and the attributes in geographic characteristics. In the following, we consider the procedure of processing geographic information data as input to a topic model. The input data for the topic model is an
N × V matrix that counts the number of words
V in each document
N. The geographic information data is an
N×V matrix with mesh
N in the row direction and attributes
V in the column direction. In the case of geographic information data, it is an
N × V matrix with
N meshes in the row direction and
V attributes in the column direction.
The topic model is a sparse learning algorithm, but if we simply take the geographic attributes in the columns of input data, the number of attribute V is very small compared to the number of mesh N, and the data is not sparse. Therefore, we create an N × V ‘matrix data by processing the attributes with positive continuous values into a discretized classes V’. First, consider a distribution of an attribute value over all the meshes. Note that such the graph of distribution is drown by setting certain intervals of aggregation, called classes. If we introduce a dummy variable (1/0) to indicate the correspondence of the record in an attribute of a mesh to a certain class of an attribute and the dummy variables are defined for all classes and for all attributes, we will obtain the 1-of-V representation for each mesh about the distributional information of the set of attributes. By stacking over the vectors of 1-of-V representation for each mesh, we can obtain the Bag-of-Words form with rows for meshes and columns for (discretized) attributes. The above preprocessing increases the number of spatial attributes and achieve a sparse structure of the data.
The output of the topic model is an
N × K matrix Θ, where the rows represent the mesh and the columns represent the distribution of geographic topics, and a
K × V’ matrix Φ, where the rows represent the geographic topics and the columns represent the distribution of class-specific attributes, as shown in
Figure 2. The former reveals the relationship between meshes and geographic topics, and the latter reveals the dominant class-attributes composes of a geographic topic. In Θ, the higher/lower values for columns of a row show the dominant/minor discretized attribute of the geographic topic, thus a set of them gives the characteristics of a mesh. In other words, Θ can be considered as a set of factor loading vectors of each mesh. In Φ, the higher/lower values for columns of a row show the dominant/minor class-attributes that are likely to co-occur in a geographical topic. In other words, Φ can be considered as a set of class-attributes loading vectors of each topic, so it gives the interpretations of geographic topics.
2.3. Model Selection
Since Dirichlet distribution is a conjugate prior distribution of multinomial distribution, the posterior distribution of the lower level also follows multiple distribution. Therefore, the estimation of topic model can rely on the Bayesian techniques. For example, MCMC, Gibbs sampler and Variational Bayes are proposed as the parameter estimation procedure. In this study, Collapsed Variational Bayes method to find the fixed point of parameters proposed by Sato and Nakagawa (2015) is used because of its efficiency [
21]. In their algorithm, hyper parameters are also estimated by fixed point algorithm with dozens of iterations. Unfortunately, the likelihood space of topic model does not have a single summit but multiple (steep) summits. Even though its outlook is difficult, most of the previous studies reported that the estimation of topic model is acceptably stable. In our study, several trials of initial values in estimation was made, in order to check the model stability.
The model choice is made to refer log-likelihood (LL) at the conversion of calculation in Equation (11). As a non-informative initial LL, the probability for each topic
K and vocabulary
V are given by uniform probability over all the alternatives, the initial LL can be given in Equation (12).
As shown in Equation (12), initial LL of topic model depends on N and V, but is irrelevant to K. In this study, LL ratio given by initial LL and LL at the last or the converged is used to compare the models.
In LDA estimation, the number of geographical topics to be extracted,
K, has to be determined. In other words, the procedure to determine
K referring to the output is left to be answered. For the sake of higher ability to describe the input data, larger
K is preferred, while smaller
K is desired for easier handling of the data. In this study, the likelihood ratio is to give the upper limit of the range of
K. The likelihood ratio indicates the goodness of fit of the model in a statistical sense, and the number of topics with the local maximum value is the best fit as increasing of
K. However, even if the statistical goodness of fit is high, the outputs contain several similar topics, and as such, the set of topics may not be easy to interpret. Therefore,
is the upper limit of the range of consideration, and if similar topics are found, they should be reduced from the estimated topic set. As an index of topic similarity, this study uses cosine similarity between topic vectors as a measure of similarity. The similarity between topic
k and topic
k’,
S is defined in Equation (13),
where the larger the value of
’, the higher the similarity between topic
k and topic
k’. However, there is no clear basis for setting a threshold for the similarity criterion that would give an appropriate K. Therefore, we set multiple thresholds and check for the topics for which the number of similar topic pairs exceeding the threshold becomes zero for the largest
k when the number of topics is sequentially reduced from
. In this study, the values of 0.7, 0.8 and 0.9 above cos45° are used as the thresholds of similarity. Then, we extract the geographical topics corresponding with 0.7, 0.8 and 0.9 as the thresholds of similarity, respectively. Then, the extracted results are compared, and the most suitable number of topics for analysis is adopted in terms of topic interpretation.
2.4. Methodological Feature of Geographical Topic to Measure the Sprawl Phenomena
The proposed model outputs geographic topics as a set of discretized land use characteristics and the loading vectors of geographic topics for each mesh. In order to clarify the characteristics of the proposed method, the 8 indices of sprawl phenomenon proposed by Galster et al. (2001) are reviewed [
22]. The indices are density, continuity, concentration, clustering, centrality, nuclearity, mixed use and proximity. These indices are supposed to be used for mesh data, because meshes are regularly arranged in space, which simplifies their measurement. Each index, if its value is low, indicates that the mesh or a subset consisting of multiple neighboring meshes is sprawling. In our approach, density is naturally considered by making discretized attributes with its rank, so then the estimated geographic topics will show the density of land use. Mixed use would be directly measured by the loading vector of geographic topic. Proximity can be measured by mixed use for neighboring meshes. Centrality is an average distance to CBD for each attributes. Continuity is geographically gradual change of a specific attribute. In the geographical topic model, the changing of the geographical topic to the neighbor’s topic corresponds to it. Concentration, clustering and nuclearity will be measured by an extent of spatially uneven distribution of geographical topics. An advantage of our approach is to extract the geographical feature of the target area is obtained by data-driven analysis, which is not possible in an ad hoc approach to use multiple indices as in [
5], [
6] or [
9].
The above discussion indicates that we can set the following three conditions for acceptance of the hypothesis in this study such that a geographical topic is named for the density characteristics (density), that geographical topic can indicates CBD area (centrality) and that the spatial distribution of topics is gradually changing from CBD to suburban areas (Continuity). For simplicity, other features are not considered in this study.
3. Data
The input information to the topic model is Bag-of-Words (BOW), which counts the number of vocabulary for each document. In case of geographical data, attributes are count data such as population, size of area, etc. The difference in data characteristics between the original document-vocabulary counts and our mesh-attribute records is the range of counting. The former ranges are lower in positive number, but the latter ranges are much wider. In order to convert the mesh-attribute records into BOW format, this study proposes to discretize a distribution by using multiple dummy variables for each attribute. Such data processing is called binning. By applying binning to geographical data, continuous distribution of an attribute of geographic information can be converted to BOW format.
At the estimation of the topic model, number of topics K has to be given. To fix an appropriate number of topics requires iterative estimations of the model. In this study, the following two step procedures are proposed. Firstly, (local) summit of log-likelihood is searched. Then, the similarity of topics in terms of topic-vocabulary matrix is calculated by using cosine similarity. Referring to the determined threshold of the similarity, some of the similar topics are merged. Topic load matrix is used to check the transition of topics between two cross-sections.
The dataset used in this study is the Kyusyuu and Fukuoka metropolitan areas, recorded in 2000 and 2010. We can set a much later period for study, but any period is possible for testing the applicability of the geographical model. Another reason for setting the period 2000–2010 is that the demographic trends in Fukuoka and Kitakyushu are more contrasting in this period than in any other. The population of Fukuoka metropolitan area in 2000 and in 2010 are about 2.337 million and 2.513 million, respectively, while the population of Kitakyusyu metropolitan area in 2000 and in 2010 are about 1.481 million and 1.425, respectively (Census in Japan in 2000/2010). In the Kitakyusyu metropolitan area, a natural decrease of population was started in 2000 and social decrease of the population continued after 1985. In the Fukuoka metropolitan area, a natural increase and social increase of population was observed between 2000 to 2010. The sum of urbanized area is about 2000 squared kilometers. Thirty-four attributes are collected form the national census, economic census and office and company statistics, the attributes of which are shown in
Figure 3. The number of meshes in the target area is totally 2113 for each cross section, by dropping the unused.
In order to estimate the common geographical topics in two cross sections, the datasets in 2000 and 2010 are stacked, so then totally 4226 samples are inputted. Looking on the attribute distribution, most of the attributes are skewed due to the missing observation. In order to consider the information brought by them, zero or missing observation is set to be an independent class for all attributes. Each distribution was binned by natural classification. In this study, seven classes are set by several trials. Natural classification gives a set of thresholds based on the second derivative of a distribution (of each attribute). Other characteristics of the dataset is “NA: No Answer”, which is given to the mesh in which observations are one or a few, in order to mask the characteristics of a very small number of targets in a mesh. Including zero and NA, a total of nine classes are set for each attribute. Since all the attribute observation in each mesh is classified into one of nine classes through the binning of attributes, all the mesh has 34 words being equal to the number of attributes. As a result, V is 34 * 9 = 306, D is 4226, is 34 and N is 4226 * 34 = 143,484.
5. Conclusions
This study made the detailed land use analysis by applying the model to two metropolitan areas in Japan, originally developed by Tsukai and Tsukano (2018) [
19]. The empirical analysis in Kitakyusyu city and Fukuoka city showed the applicability and validity of the proposed model. The information obtained from the topic loading matrix can be utilized in a variety of ways, including dominant plot mapping, weighted aggregation by topics with other related attributes, and the composite topic aggregation along with the accessibility to public transportation. The estimated geographical topics fulfilled the three conditions for appropriateness of geographical topic model, discussed in 2.4. Therefore, we can conclude that the model is highly capable of capturing land use characteristics. The advantage of proposed model is to give the detailed information in land use types with their locations. Based on the outputs of the model, the characteristics in land use change in the cities were quantitatively clarified. The combined use of the output of the model with other geographical information is useful to find appropriate policies in land use.
Further analysis will make it possible to find a consistent land use policy of a city. For example, if the compactness of land use is clarified in the city center and around the road-side, a comprehensive policy for controlling core, suburban and fringe areas in a city can be discussed. From the viewpoint of usefulness in the observation of sprawl phenomenon, remaining five indices (concentration, clustering, nuclearity, mixed-use and proximity) should be measured and tested of their performance on sprawling, based on the outputs of proposed model. In terms of model development, it is desirable to increase the classification of land use patterns in urban centers in order to increase their empirical usefulness. For this purpose, a novel pre-processing of input data should be developed. For example, binning of continuous attributes (how to discretize the geographic attributes) can be improved. By comparing the performance of the models using different methods of binning attributes as a preprocessing of the input data, the limitations of the current model and the areas for improvement will become apparent.