1. Introduction
In recent years, due to the proactive disclosure of data by government agencies and private companies, detailed spatial data with high spatial resolution have become available, allowing us to quantitatively understand the reality of socioeconomic activities in greater detail.
Geospatial data commonly have a property known as spatial autocorrelation, also known as the ‘first law of geography’ [
1], where data from locations closer to each other have a stronger correlation. One approach for the analysis of geospatial data with spatial autocorrelation is to assume that the data generation process is common regardless of location and to build models that represent the spatial autocorrelation of dependent variables, disturbances, etc. The spatial regression models in the field of spatial econometrics (e.g., [
2]), the universal kriging model in the field of spatial statistics (e.g., [
3]), and the eigenvector spatial filtering (ESF) model in the field of quantitative geography [
4,
5] are examples of this approach.
Another approach is to assume the existence of spatial heterogeneity, which means that the data generation process differs by location. This study focuses on this approach, which can be broadly divided into two types depending on assumptions about the structure of spatial heterogeneity [
6]. One assumption is that the influence of the data formation factors varies continuously with respect to spatial location. The competing assumption is that the influence of the data formation factors differs discontinuously at certain spatial boundaries.
Models which use the first assumption, that data formation factors vary continuously, are known as spatially varying coefficient (SVC) models, and the most applied model is geographic weighted regression (GWR) [
6,
7,
8]. A weighted least-squares method, which uses a distance decay function to give large weights to geospatial data in the vicinity of an analysis target point, is used to obtain point-specific estimates of coefficients that vary smoothly in space. The eigenvector spatial filtering-based spatially varying coefficient (ESF-SVC) model [
9,
10,
11] is an extension of ESF [
4,
5], which represents the spatial correlation structure using the eigenvectors of the spatial weight matrix and estimates the coefficients that vary continuously in space. ESF-SVC is a highly useful analysis method with several advantages over GWR, such as the ability to represent the structure of spatial heterogeneity more flexibly, easier estimation, and improved applicability to large scale data [
11]. It has been applied to various regional analyses [
12,
13,
14,
15].
Models which use the second assumption, that data formation factors vary discretely at certain spatial boundaries, have been used for the analysis of the geographical segmentations of the real estate market in regional sciences and the detection of point event clusters, such as geographical clusters of infectious disease outbreaks in epidemiology and criminal concentration areas in criminology. In the analysis of geographic real estate market segmentations, the scale of spatial heterogeneity in property valuations is analyzed by comparing models with different geographic tessellation, such as school districts and neighborhoods (e.g., [
16,
17]). In the detection of point event clusters, spatial scan statistics [
18], which are representative methods developed in the field of spatial epidemiology, and agglomeration analysis [
19,
20], which apply the false discovery rate controlling method [
21], have been used to determine whether the frequency of point events in a particular region differs from that of other regions, according to predetermined regional divisions. As both approaches predefine regional divisions, they have limitations in analyzing a series of multiple regions with spatial heterogeneity. To solve the above-mentioned limitations, the generalized lasso (GL) [
22] has been applied for analyses of discrete spatial heterogeneity [
23,
24,
25,
26,
27,
28,
29]. The GL is an extension of the lasso [
30] by introducing
ℓ1 regularization to the difference between the coefficients of adjacent regions. Using a model in which each region has a coefficient that represents the difference from the overall trend, a series of regions with the same level of spatial heterogeneity can be extracted by regularizing the difference between the coefficients of neighboring regions.
The above-mentioned approaches use different assumptions on spatial heterogeneity; the former assumes the heterogeneity that varies continuously in space, while the latter assumes the heterogeneity that varies discretely at specific geographical borders. However, which assumption is adequate to represent the spatial heterogeneity of geospatial data is still unclear.
Using the real estate market analysis as an example, those who prefer to live in the city center and those who prefer to live in the suburbs evaluate real estate properties differently; the former value proximity to urban services and the latter value proximity to natural environment and broad space; the coefficients of the explanatory variables expressing proximity to those services differ by location. If there are no clear boundaries between urban and suburban areas, as is the case in Japan’s metropolitan areas, it is reasonable to assume that coefficients are continuously changing in space. In addition, if there are specific neighborhoods recognized as high-end residential areas, the valuation may differ from the surrounding areas and real estate prices will show discontinuity at the border of those neighborhoods. Considering the above, there might exist both continuous and discrete spatial heterogeneity in real estate price data. However, previous studies have not sufficiently discussed how to consider these two different types of spatial heterogeneity simultaneously and how to detect them separately, which may lead to biased estimates and the wrong interpretation of geospatial phenomena.
This paper proposes a new approach for the analysis of geospatial data with both continuous and discrete spatial heterogeneity by fusing ESF-SVC and GL and evaluates its performance to verify its effectiveness through the application of geospatial data analysis.
The remainder of this paper is organized as follows. In
Section 2, after the ESF-SVC and GL are outlined, the fusion model, the eigenvector spatial filtering, and generalized lasso-based spatially varying coefficient model (ESF-GL-SVC) are presented and are evaluated using simulation data. In
Section 3, we apply the ESF-GL-SVC, ESF-SVC, and GL models to the residential rent data in the southwestern part of the Tokyo metropolitan area, interpret the estimated results, and discuss the effectiveness of the ESF-GL-SVC model. Finally,
Section 4 concludes the study.
2. ESF-GL-SVC: A Model to Analyze Continuous and Discrete Spatial Heterogeneity
We first explain two previous models that deal with spatial heterogeneity, introduce the ESF-GL-SVC, which is a fusion of the two, and then represent the performance evaluation.
2.1. Previous Models for Spatial Heterogeneity
2.1.1. Eigenvector Spatial Filtering-Based Spatially Varying Coefficient (ESF-SVC) Model
The ESF-SVC model utilizes the common statistical test for spatial autocorrelation, Moran’s I, to express the spatial heterogeneity of coefficients. Here we consider the analysis that the geospatial data at location i, yi, is regressed on the attributes of the K types of its attributes xij. Let N denote the number of locations, y denote an N by 1 vector of dependent variables, X denote an N by K matrix of explanatory variables whose first column is an N by 1 vector of ones. β is a K by 1 coefficient vector of the regression of y on X and the first element is an intercept.
Let
C denote an
N by
N spatial proximity matrix between
N locations,
1 denote an
N by 1 vector of ones,
I denote an
N by
N identity matrix, and
M denote an
N by
N cantering matrix for
N by 1 vector, which is
M = (
I −
1 1′/
N). Then the Moran’s I statistics
MC of dependent variables
y observed at
N locations is:
The term MCM represents the autocorrelation structure of y that have spatial proximity expressed by C.
Griffith [
4] proposed an ESF, which utilizes the eigenvectors of matrix
MCM as explanatory variables in linear regression models to express the spatial correlation of dependent variables and remove the correlation from disturbances. The eigenvectors that correspond to large eigenvalues represent the continuous global spatial correlation patterns and the eigenvectors with small and positive eigenvalues represent the continuous local spatial correlation patterns.
Let
E denote an
N by
M matrix, which consists of the first
M eigenvectors, whose eigenvalues are the largest, of matrix
MCM and whose
i-th column is a
i-th eigenvector
ei,
γ denote an
M by 1 coefficient vector for the explanatory matrix
E. The basic linear model of ESF is given by:
where
. It is assumed that the disturbance
ε have homoscedasticity and no correlation.
Griffith [
9] expanded ESF to an ESF-SVC model to estimate spatially varying coefficients that represent the heterogeneity of coefficients of explanatory variables. The location-based coefficient for explanatory variables
xk and
is modelled by:
where
represents the coefficients that vary by location, and the ESF-SVC model is:
where
denotes the Hadamard product and
. It assumes that the spatial autocorrelation of dependent variables is represented by the heterogeneity of
and there is no spatial autocorrelation left on disturbances.
The ESF-SVC model that utilized many eigenvectors might cause the overfitting problem. One solution is to utilize only eigenvectors whose Moran’s I statistics are larger than one-fourth of Moran’s I statistics of the first eigenvector [
31]. It turns out to be only eigenvectors that correspond to large eigenvalues and represent continuous global spatial heterogeneity are utilized to express the spatial heterogeneity of spatially varying coefficients, then their continuous local spatial heterogeneity cannot be considered in the model.
The ESF-SVC model is used in this study because it can be described as a linear regression model and has high compatibility with GL, which performs ℓ1 regularization for a linear regression model.
2.1.2. Generalized Lasso (GL)
The lasso (least absolute shrinkage and selection operator) [
30] is a most common sparse modelling method that adds
ℓ1 penalty terms to the objective function of estimation to obtain the sparse estimates of coefficients. It is often utilized in the variable selection. The estimation of the linear regression model by lasso is given by:
where
λ denotes a weight for the penalty term.
The GL [
22] is an expansion of lasso that adds
ℓ1 penalty term on the differences between ‘adjacent’ coefficients in addition to the
ℓ1 penalty term on the coefficients themselves. It can detect change points when it is applied to time-series analyses and borders when it is applied to spatial analyses. The estimation of the linear regression model with GL is represented by:
where
λ and
δ are hyperparameters of weights on penalty terms and
C is a set of pairs of adjacent coefficients.
The GL has been applied to detect discrete spatial heterogeneity. It is applied not only to linear regression analyses [
22,
23,
24] but also to many types of analyses, such as the estimation of the region-specific spatial covariance function [
25], the estimation of the spatial and temporal quantile function [
26], and the spatial cluster detection of point events based on Poisson regression models [
27,
28,
29].
2.2. ESF-GL-SVC
We propose an ESF-GL-SVC model to analyze continuous and discrete spatial heterogeneity simultaneously by combining the ESF-SVC and GL.
Continuous heterogeneity is represented by coefficients of ESF-SVC. The ESF-GL-SVC model utilizes the eigenvectors whose Moran’s I statistics are larger than one-fourth of Moran’s I statistics of the first eigenvector [
31]. Discrete heterogeneity is represented by coefficients of dummy variables
α that are set for all subregions in the study area. Let
D denote the subregion dummy variable matrix; the element
Dir is one if point
i is in subregion
r and zero otherwise. Then, the ESF-GL-SVC model is expressed as:
The estimation of the ESF-GL-SVC model with the GL regularization term can be represented by Equation (8), where
C denotes the set of combinations of adjacent subregions, and
λ,
δ1, and
δ2 are the hyperparameters.
The second term, the ℓ1 penalty term on the differences between coefficients of adjacent subregions, enables the extraction of a series of subregions with the same level of spatial discrete heterogeneity; it can mitigate the scale issues associated with the preset segmentation of subregions. The third term, the ℓ1 penalty term on the coefficients of subregion dummy variables, and the fourth term, the ℓ1 penalty term on the ESF-SVC coefficients, have the effect of inducing a sparse solution. When data have only continuous spatial heterogeneity, the third term functions to lead to an estimation result where the values of the coefficients of subregion dummies are zero, and when data have only discrete spatial heterogeneity, the fourth term functions to lead to an estimation result where the values of the ESF-SVC coefficients are zero. The regularization for the coefficients themselves can reduce the occurrence of parameter identification problems between the ESF-SVC and the subregion dummy variable coefficients.
Three hyperparameters should be selected according to the fitness of the models. The Bayesian information criterion (BIC) might be an option. However, since the estimated coefficients would be biased due to the regularization terms of lasso [
32], the model selection utilized by the estimates of Equation (8) is not appropriate. Therefore, we proposed the estimation of coefficients first by Equation (8), building a model that corresponds to the estimated results by removing coefficients estimated as zero and setting a common dummy subregion coefficient if adjacent subregions have the same estimates, estimate the model without regularization terms, and evaluating the results by BIC. These procedures avoid biased estimates, and the fitness of model would be evaluated appropriately. The estimation of the ESF-GL-SVC model is executed by the ‘genlasso’ package [
33] in R. This package estimates coefficients by gradually varying hyperparameter
λ under a fixed ratio of weights
δ. This study set both
δ1 and
δ2 in the estimation of fusion model and
δ for the estimation of GL as {0.1, 1, and 10} and searched for the estimation result with the minimum BIC value. The BIC-based evaluation may also be useful in reducing the possibility of parameter identification problems by selecting models with fewer non-zero coefficients.
2.3. Performance Evaluation by Simulation Experiments
We evaluated the performance of ESF-GL-SVC applied to simulated data with continuous and discrete spatial heterogeneity. The settings for the simulation data generation are outlined below.
We set a square with side length 1 as a study area and generated a predetermined number of points in it. The coordinates of points are generated by the uniform distribution between 0 and 1. Simulated data were generated from:
where
x is a vector of explanatory variables whose elements are generated from the uniform distribution between 0 and 1.
To generate the data values with continuous spatial heterogeneity at each point, we set a spatial proximity matrix
C by the gaussian kernel whose width is 0.2, obtained eigenvectors of the matrix
MCM by the approximate calculation by [
34], selected eigenvectors whose eigenvalues were larger than one-fourth of the largest eigenvalues according to [
31], and generated simulation data at each point by setting coefficients
βs and
γs. To simulate spatially varying coefficients with different structures of spatial heterogeneity for each trial, one eigenvector was randomly selected for each explanatory variable. The coefficient
γ of selected eigenvectors was set to one and other coefficients were set to zero. The non-spatially varying coefficients,
β1 and
β2, were set to one.
To generate the data value with discrete spatial heterogeneity, we divided the study area into 10 by 10 square subregions with side length 0.1 and assigned a dummy variable to each subregion. Let α denote a vector of subregion dummy variables and D denote a matrix to assign subregion dummy variables to each point. The coefficients of dummy variables of the four subregions in the center of study area in fifth and sixth rows and fifth and sixth columns were set to one and the other coefficients were set to zero.
Then the simulated value at each point was generated adding disturbances independent and identically distributed according to a normal distribution.
Table 1 summarizes the simulation data generation settings for the following three experiments. The first two experiments evaluate the performance of ESF-GL-SVC by changing the amount of data and the size of variance of disturbances. The last experiment compares the performance of ESF-GL-SVC with those of ESF-SVC and GL. Simulation data are generated 1000 times for all experiments. The codes for the simulation experiments are available in the
Supplementary Materials.
2.3.1. Effect of Amount of Data on Performance of ESF-GL-SVC
Five different settings were set for the total number of points to check the effect on model estimation. When generating data whose average numbers of points in each subregion were set to two, five, and ten, the numbers of points in each subregion were controlled to be the same and the position of points were randomly set inside each subregion. When generating the data by other settings, points were randomly distributed in the whole area and then the number of points in each subregion was not the same.
Figure 1 shows the relationship between amount of data and Moran’s I statistics of residuals, and
Figure 2 and
Figure 3 show the root mean square error (RMSE) between simulated and estimated coefficients and the average of estimation variances of coefficients of the intercept
and the explanatory variable
, respectively.
The precision of estimation increases as the number of points increases. When the numbers of points in each subregion are two and five, the RMSEs between simulated and estimated coefficients and estimation variances are quite large; however, when the number of points in each subregion exceeded 10, it was confirmed that the estimation precision was high and the spatial correlation in residuals were removed. The experiment suggests that the estimated results are stabilized when the number of points in each subregion is larger than 10.
2.3.2. Effect of Magnitude of Variances of Disturbances on Performance of ESF-GL-SVC
Five different settings for the standard deviations of disturbances were tested to determine their effect on model estimation.
Figure 4 and
Figure 5 show the RMSEs between the simulated and estimated coefficients and the estimation variances of coefficients of the intercept
and the explanatory variables
, respectively.
When the standard deviations of disturbances were set to 0.8 and 1.2, the spatially varying coefficients were not estimated properly. The RMSEs between simulated and estimated coefficients and the estimation variances of coefficients were large at these settings. Considering that the values of , the simulated values before adding disturbances, had a standard deviation of 0.710 in the average of simulations, it is confirmed that the spatial heterogeneity of coefficients cannot be specified if the standard deviation of disturbances exceeds that of the effect of explained variables.
2.3.3. Comparison with Previous Models
We estimated the coefficients of ESF-GL-SVC, ESF-SVC, and GL models and evaluated their performances by the BIC calculated by the model without regularization terms, Moran’s I statistics of residuals, and the RMSEs between simulated and estimated coefficients of each location. The plot of spatial distribution of estimated coefficients were also considered for the evaluation.
The distributions of BIC and Moran’s I statistics of residuals for data with continuous and discrete spatial heterogeneity are shown in
Figure 6, the distribution of RMSEs between simulated and estimated coefficients and the average of estimation variances of coefficients and are shown in
Figure 7 and
Figure 8, respectively, and the spatial distributions of simulated and estimated coefficients by three models at one simulation are shown in
Figure 9.
The distribution of BICs and Moran’s I statistics of residuals reveals that the ESF-GL-SVC outputs the results with the smallest BIC, removing the spatial autocorrelation from the residuals. The differences between the estimators by the ESF-GL-SVC and the simulated coefficients are very small, representing the continuous heterogeneity by the estimates of spatially varying coefficients and the discrete heterogeneity by the estimates of subregion dummy coefficients. On the other hand, the ESF-SVC and GL models failed to remove the spatial autocorrelation from the residuals, and the estimators were different from the simulated coefficients with large variances.
Figure 9 indicates that the ESF-SVC represents the discrete heterogeneity by the continuous spatially varying coefficients and the GL represents the continuous heterogeneity by the discrete subregion-based coefficients. Even though the coefficients of subregion dummy variables of the proposed model could potentially be used to represent continuous spatial heterogeneity, only discrete heterogeneity is extracted in this estimation by them. This may indicate that the regularization for the coefficients of the subregion dummy variables and the model selection based on BIC are effective in avoiding parameter identification problems between continuous and discrete spatial heterogeneity. If both continuous and discrete spatial heterogeneity exists, they would fail to extract the structure of spatial heterogeneity.
The experiments confirmed that the ESF-GL-SVC model showed better performance than the others for the data with continuous and discrete spatial heterogeneity.
3. Application to Apartment Rent Data
3.1. Data and Models
We applied the ESF-GL-SVC, ESF-SVC, and GL models to the apartment rent data in the Shibuya, Setagaya, and Meguro wards in the southwestern part of the Tokyo Metropolitan area in 2017. The data was collected by ‘At Home Co., Ltd.’; the company publishes real estate price information.
High-rise condominiums whose number of floors exceed 15 were excluded as their rents have a different pricing structure from the pricing of other apartments. If the rent data from one building make up most data for that neighborhood, it might cause difficulties in estimating coefficients because the most explanatory variables have the same or similar values. Thus, to avoid such issues, only one apartment was randomly selected from each floor of each building. Consequently, the total number of records was 13,748.
The dependent variable is the logarithm of rent per square meter in Japanese Yen (JPY), and explanatory variables are the logarithm of ‘floor level,’ ‘building age,’ ‘property size,’ ‘walking time to the nearest train station’ (hereafter, ‘time to train service’), and ‘average of travel time by train service to five major stations located in central business districts (CBD)’ (hereafter, ‘time to CBD’). As there are properties whose building ages are zero, we added one when we calculated the logarithm of building age. The selected major stations were Shinjuku, Ikebukuro, Shibuya, Tokyo, and Shinagawa, which serve the largest numbers of passengers in the Tokyo Metropolitan area. We surveyed the travel time that includes the half of train headway to five major stations when leaving the station at noon on weekday on the public transport route planning service of Yahoo! Transit and calculated the weighted average by the numbers of passengers at major stations.
Table 2 summarizes the descriptive statistics of dependent and explanatory variables. Since the scales of explanatory variables affect the regularization terms on the estimation of each coefficient, the explanatory variables were standardized to zero mean and unit variance.
To represent continuous heterogeneity by spatially varying coefficients in the ESF-GL-SVC and ESF-SVC models, the elements of spatial proximity matrix C were set by inputting distances between properties into the gaussian kernel function with the band at 500 m intervals from 1 km to 5 km. For each band setting, the eigenvectors of MCM, whose eigenvalues are larger than the one-fourth of the largest eigenvalue, were selected as the explanatory variables in the two models. In this application, we selected the band of 4.5 km for the ESF-GL-SVC and 2 km for the ESF-SVC that output the smallest BIC by each model. In the GL estimation, ward-based coefficients for explanatory variables were estimated to represent the heterogeneity of valuation in global scale.
To represent discrete spatial heterogeneity by subregion dummies in the ESF-GL-SVC and GL models, the study region was divided into 445 neighborhoods (cho and cho-me in Japanese) and at least one real estate property existed in 431 neighborhoods. A subregion dummy was set for each neighborhood except the reference neighborhood, Hachimanyama 3 cho-me. The selected reference neighborhood is the neighborhood where the root mean square of residuals by ESF-SVC is the minimum. The adjacency of the coefficients of subregion dummies is set when neighborhood polygons share borders.
This analysis used the same search process of the best hyperparameter setting for each model as the simulation experiments;
δ1 and
δ2 in the ESF-GL-SVC model and
δ in the GL are set to {0.1, 1, 10}, and the estimation was performed by changing
λ using the R package of ‘genlasso’. The results with the
ℓ1 regularization were used only to select variables whose estimators are non-zero; then, BICs were calculated via OLS estimation of models with selected variables. The code and sample data for the analysis are available in the
Supplementary Materials.
3.2. Estimation Results
Table 3 summarizes the estimated results of the three models. First, all three models effectively consider the spatial autocorrelation of dependent variables, as Moran’s I statistics of residuals indicate that residuals do not have spatial autocorrelation. Second, it is confirmed that the ESF-GL-SVC model shows the best performance as the coefficients of determination is the maximum and the BIC is the minimum.
The estimated intercepts and coefficients of explanatory variables by the ESF-GL-SVC and ESF-SVC models are summarized in
Table 4 and
Table 5, respectively, and the estimated ward-based coefficients by GL is summarized in
Table 6. The spatial distribution of intercepts is shown in
Figure 10, the spatial distributions of the subregion coefficients by the ESF-GL-SVC and GL model are shown in
Figure 11, and the spatial distributions of coefficients of explanatory variables are shown in
Figure 12,
Figure 13,
Figure 14,
Figure 15 and
Figure 16.
3.3. Discussion through Comparison of Results by Three Models
3.3.1. Intercepts and Coefficients of Subregion Dummies
First, we focused on the intercepts by the ESF-GL-SVC and ESF-SVC models (
Figure 10) and subregion dummies using the ESF-GL-SVC model (
Figure 11a). The sum of estimated results of intercepts and subregion dummies by the ESF-GL-SVC model and the estimated results of intercepts by ESF-SVC model have higher values in the eastern part of the study area, which is closer to the CBD, indicating that the rents are expensive in the area. However, the maximum intercept shown in
Table 4 is smaller than that in
Table 5, as the discrete spatial heterogeneity is represented by the coefficients of the subregion dummies by the ESF-GL-SVC model.
The south of Meguro ward is a well-known high class residential area; the ESF-GL-SVC model extracted the discrete spatial heterogeneity represented by positive subregion coefficients and the ESF-SVC model extracted the continuous spatial heterogeneity by larger values of spatially varying intercept. In the next section, we will comment on which identification of spatial heterogeneity is appropriate with the results of the estimation of the coefficients of the explanatory variables.
Second, we compared the estimates by the GL with those by the two other models. The estimated intercept by GL is close to the averages of the coefficients by the ESF-GL-SVC and ESF-SVC models. The number of subregions with non-zero coefficients by the GL is larger than that by the ESF-GL-SVC model; the GL model represents the spatial heterogeneity of rent by the subregion-based heterogeneity (
Figure 11b). The result of GL indicates similar spatial patterns to those of the ESF-GL-SVC and ESF-SVC models; however, the GL model has higher BIC and lower adjusted coefficient of determination than the ESF-GL-SVC model; it seems that the discrete spatial heterogeneity only is not appropriate for the spatial heterogeneity representation of rent.
3.3.2. Coefficients of Explanatory Variables
First, it is confirmed that all three models can distinguish whether the coefficients of the explanatory variables have spatial heterogeneity. The spatial patterns of estimates reveal that the coefficients of ‘floor level’ and ‘building age’ have weak heterogeneity, but the coefficients of ‘property size’, ‘time to train service’, and ‘time to CBD’ have strong heterogeneity.
Second, the estimates by the ESF-GL-SVC and ESF-SVC models are similar, except for the coefficients of ‘time to CBD’. The estimates for ‘property size’ and ‘time to train service’ are almost the same for the ESF-GL-SVC and ESF-SVC models, and the estimates for the GL also show similar spatial patterns, although the spatial resolution of the ward-based coefficients is quite low.
If the neighborhood-based coefficients are set for all explanatory variables in the GL model, it might be possible to represent the spatial heterogeneity of coefficients more precisely; however, as the order of calculation of GL is estimated as
O(
mn2 +
Tm2), where
m is the number of constraints,
n is the number of explanatory variables, and
T is the number of iteration and should be larger or equal to
m [
22], the feasibility of estimation of high resolution model with large
m and
n is low. Thus, the ESF-GL-SVC model has advantages over the GL model in utilizing the spatially varying coefficient model to express the continuous spatial heterogeneity.
Third, the coefficients of ‘time to CBD’ reveal the advantage of ESF-GL-SVC model over ESF-SVC. It is natural that the coefficients of ‘time to CBD’ are negative, but the estimates by ESF-SVC have both positive and negative values, especially indicating large positive values in the northwest and southwest corner of the study area. On the other hand, the estimates by the ESF-GL-SVC model are all negative using the same eigenvector setting.
This indicates that ESF-SVC overfits to the dataset. It is caused by the utilization of a shorter band for spatial proximity matrix
C and more eigenvectors that represent local spatial heterogeneity patterns in the ESF-SVC model. The cause of overfit is likely because ‘time to CBD’ has strong spatial autocorrelation on a global scale, although other explanatory variables do not. By multiplying the coefficients and explanatory variables, both of which have spatial autocorrelation, it is possible to express the spatial autocorrelation of dependent variables. The rents in the south of Meguro ward have strong local and discrete spatial heterogeneity, represented by the estimated coefficients of subregion dummies shown in
Figure 11. The ESF-SVC model tried to express the local heterogeneity by the intercept (
Figure 10b) and the coefficient of ‘time to CBD’ (
Figure 16b); as a consequence, the coefficients of ‘time to CBD’ in the surrounding area were highly variable and difficult to interpret. This misspecification by ESF-SVC is avoidable if the only longer band is set for spatial proximity matrix
C; however, when the longer band was set, the eigenvectors represent only global spatial patterns, and the ESF-SVC model would lose explanatory power for the local heterogeneity.
3.4. Summary of Application to Apartment Rent Data Analysis
The application to apartment rent data reveals that the ESF-GL-SVC model outperforms the two previous models; it has advantages in extracting both continuous and discrete spatial heterogeneity, which apartment rent data have.
The ESF-GL-SVC model was able to estimate the spatially varying coefficients that are interpretable, although the ESF-SVC model was not. The analysis by the ESF-GL-SVC model clarifies the structure of spatial heterogeneity of the effect of explanatory variables. The coefficients of property size have the largest spatial heterogeneity; the size does not affect the rent per square meter near the CBD but decreases in suburban areas. It also depicts the existence of discrete spatial heterogeneity by neighborhood-based coefficients.
4. Discussion
This study proposed an analysis to extract both continuous and discrete spatial heterogeneity by ESF-GL-SVC combining ESF-SVC and GL. Through the analysis of simulated data and apartment rent data, it is confirmed that the ESF-GL-SVC model can separate the continuous and discrete spatial heterogeneity of dataset. It is possible to avoid the overfitting the dataset and output the interpretable estimates.
There are three ways to improve the ESF-GL-SVC model. The first improvement would be the mitigation of the effect caused by biased estimates of lasso. It was pointed out that the lasso estimator is biased toward zero and does not have an oracle property, which consists of the consistency in variable selection and the asymptotic normality [
32]. The minimax concave penalty (MCP) was proposed to mitigate the bias [
35] and was expanded to the fused MCP [
36], which can reduce the bias of GL. We analyzed the geographic market segmentation of apartment rent [
37] and the extraction of spatio-temporal changes in real estate market prices [
38] by fused-MCP and confirmed that the fused-MCP outputs better estimates than the GL. However, there is an issue of computational complexity. The construction of a fusion model of ESF-SVC and fused-MCP with efficient estimation is an effective extension. The second improvement would be the mitigation of the overfitting problem of ESF-SVC. Murakami et al. [
10] proposed a random effects ESF-SVC (RE-ESF-SVC) model based on the random effects specification of ESF [
34], and Murakami et al. [
39] confirmed that RE-ESF-SVC is one of the models that can estimate the structure of spatially varying coefficients accurately and is the most computationally efficient. The application of RE-ESF-SVC would improve the estimation of spatially varying coefficients; however, as the estimation of RE-ESF requires the restricted maximum likelihood method, the introduction of regularization with the
ℓ1 norm or MCP functions might be challenging. The third improvement would be to enhance the capability of the method to analyze spatial heterogeneity at various scales. The proposed method is structured to analyze phenomena consisting of global and continuous spatial heterogeneity and local and discrete spatial heterogeneity. There might be local and continuous spatial heterogeneity, such as the impact of a small park on the neighborhood environment, and global and discrete spatial heterogeneity, such as the impact of regional boundaries in interconnected urban areas. RE-ESF-SVC and multi-scale GWR [
40] have been proposed as methods to consider various scales in continuous spatial heterogeneity analysis, and group lasso [
41] and tree structured group lasso [
42] are expected to consider multiple scales of discrete spatial heterogeneity. Considering multiscale heterogeneity for both continuous and discrete spatial heterogeneity is an important development direction for this research.