A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data

Wang, Dongchao; Yang, Yi; Qiu, Agen; Kang, Xiaochen; Han, Jiakuan; Chai, Zhengyuan

doi:10.3390/ijgi9110653

Open AccessArticle

A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data

by

Dongchao Wang

¹

,

Yi Yang

^1,*,

Agen Qiu

²,

Xiaochen Kang

²,

Jiakuan Han

¹ and

Zhengyuan Chai

¹

School of Geomatics and Marine Information, Jiangsu Ocean University, Lianyungang 222005, China

²

Research Center of Government GIS, Chinese Academy of Surveying and Mapping, Beijing 100039, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2020, 9(11), 653; https://doi.org/10.3390/ijgi9110653

Submission received: 3 September 2020 / Revised: 23 October 2020 / Accepted: 26 October 2020 / Published: 30 October 2020

(This article belongs to the Special Issue Advances in Computational Approaches for Spatial Analysis and Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

Geographically weighted regression (GWR) introduces the distance weighted kernel function to examine the non-stationarity of geographical phenomena and improve the performance of global regression. However, GWR calibration becomes critical when using a serial computing mode to process large volumes of data. To address this problem, an improved approach based on the compute unified device architecture (CUDA) parallel architecture fast-parallel-GWR (FPGWR) is proposed in this paper to efficiently handle the computational demands of performing GWR over millions of data points. FPGWR is capable of decomposing the serial process into parallel atomic modules and optimizing the memory usage. To verify the computing capability of FPGWR, we designed simulation datasets and performed corresponding testing experiments. We also compared the performance of FPGWR and other GWR software packages using open datasets. The results show that the runtime of FPGWR is negatively correlated with the CUDA core number, and the calculation efficiency of FPGWR achieves a rate of thousands or even tens of thousands times faster than the traditional GWR algorithms. FPGWR provides an effective tool for exploring spatial heterogeneity for large-scale geographic data (geodata).

Keywords:

CUDA; GWR; parallel computation; large-scale geodata

1. Introduction

Large-scale geodata is currently a topic of considerable attention in many research fields, including mobile communication [1], public transportation [2], medical health [3], Earth observation [4], and climate monitoring [5]. To enhance the capability of analyzing massive geodata, geographic knowledge mining is turning to data-driven patterns [6]. Distributed system and parallel computing are two feasible technologies to solve the problem of massive geodata analysis. A tremendous amount of multisource geodata is stored in a distributed spatial index system [7], enabling people to access records efficiently. Using the advantages of the distributed system Hadoop, Aji et al. (2019) [8] proposed a scalable high-performance spatial data warehousing system (Hadoop-GIS) that can meet the needs of managing and querying massive geodata. Furthermore, based on the MapReduce parallel computing framework and the HadoopBase database (HBase) technology, the origin–destination (OD) estimation method [9] can efficiently manage massive bus travel data and directly reckon the origin and destinations of travel for bus passenger. In the parallel computing field, large-scale geodata could be parallelize into multiple data pieces utilizing the strategies of multiple instruction multiple data (MIMD) and single instruction multiple data (SIMD). MIMD handles multiple instructions simultaneously in opposition to SIMD. There are several environments to parallelize multiple tasks based on different strategies (SIMD, MIMD), such as a message-passing interface (MPI), a multi-core CPU, and a many-core shared-memory graphics processing unit (GPU). MPI is mainly used to standardize the communication protocol of multi-program cluster, multi-core CPU relies on the computing power of CPU core, and many-core shared-memory GPU benefits from numerous stream processors (SP). Wilkinson et al. (1999) [10] introduce parallel programming techniques and how to solve problems at a greater computational speed than is possible with a single computer. Gong et al. (2013) [11] proposes a parallel approach that leverages the power of multicore systems, to cope with the computational complexity of agent-based models (ABMs), and it solves the space-time complexity of a geographic system. Tang et al. (2015) [12] and Zhang et al. (2017) [13] explored the feasibility of using GPU to carry out the massively parallel spatial computing and accelerate the spatial point pattern analysis. Sandric et al. (2019) [14] undertook parallelization for certain GIS features operations using their message-passing interface–GIS (MPI-GIS) system, which integrated the advantages of MPI input/output (I/O) and GPU on a cluster of nodes. Stojanovic et al. (2019) [15] proposed an algorithm to analyze with watershed approach, called multiple flow direction (MFD), which was designed for multicore CPU or many-core GPU. Amazing progress has been achieved in the fields of computer hardware and software, which lay a solid foundation for updating geographical research tools. However, there is still a sizable problem to be solved: how existing geographic analysis tools can be transformed to accommodate the development of big geodata mining [16]?

Spatial non-stationarity analysis is an important research field of spatial data mining. Brunsdon et al. (1996) [17] proposed the effective tool (GWR model) to explore spatial non-stationarity. GWR introduces the idea of local smoothness to calibrate the regression coefficients and detect spatial non-stationarity in the geographic space. The expansion of the location factor upgrades GWR from ordinary linear regression (OLR) model to a local regression model. The locally weighted least squares (LWLS) method is used to estimate the parameters point by point, where the weight refers to the distance kernel function of some point against each observation points. The results of parameter estimation from GWR are both clearly interpretable and statistically verifiable; therefore, GWR has become a major method for studying spatial heterogeneity. Zhang et al. (2020) [18] employed GWR to identify the driving forces of wastewater discharge between provinces in China and discovered that the macro industry policy and environmental protection measures were major reasons for its spatial changes. Wu (2020) [19] explored the influencing factors that cause spatially and temporally varying distributions of ecological footprints using GWR. Yuan et al. (2020) [20] applied GWR to reveal the spatially varying relationships in environmental variables (Pb and Al) and suggested that GWR was more effective than conventional statistical analysis tools. Hong et al. (2020) [21] researched the spatially heterogeneous relationship between price and pricing variables using multiscale geographically weighted regression (MGWR), in which it overcame the limitations of hedonic pricing model research for sharing economy accommodation. Wu et al. (2020) [22] developed a geographically and temporally neural network weighted regression (GTNNWR) model that was extended from the spatiotemporal proximity neural network (STPNN), which not only exhibited a better prediction performance but also more accurately quantified the distribution of spatiotemporal heterogeneity.

Typically, parallelization of geographic analysis tools has become a comprehensive subject across computer field and geography science. The package spgwr [23] was developed to implement GWR in the R language. Another R package (GWmodel) [24] optimized this model with a moving window weighting technique and achieved slightly better efficiency against spgwr. The Python-based implementation (mgwr [25]) of MGWR was developed for multiscale analysis that allowed varying relationships according to each coefficient. Li et al. (2019) [26] (a member of the mgwr package) upgraded its mode to distributed parallelization utilized within a high-performance computing (HPC) environment and the new package (FastGWR) achieved satisfactory results. Tran et al. (2016) [27] studied the implementation of large-scale GWR on an in-memory cluster computing framework Spark (Spark-GWR) and determined that it was a feasible solution using cluster computers to execute GWR in parallel, but great difficulty is encountered for ordinary coders in developing and testing under the cluster environment. As a representative model of local regression, GWR incorporates all of the observations (samples) into the loop of the regression sequence. The key to geographic weighting is the calculation of distance weights for each sample, where it causes costly complexity in terms of runtime and memory. At the same time, the entire process consumes a large amount of computing time because the weight calibrator participates in multilayer loops. Under the condition of large-scale geodata, GWR needs to go through two levels of large cycle iteration, the outer iteration is responsible for point by point regression, and the inner iteration is used for matrix calculation between single sample and full samples. Therefore, limited by data structure and operating mode, GWR is less effective in addressing large-scale geodata. Concurrency methods can improve the efficiency of geographic analysis tools depending on the software optimization, but the hardware parallel environment could obtain native support and achieve the best acceleration performance. Both FastGWR and Spark-GWR could divide GWR into several parallel task sets, and the two parallel programs are designed for CPU architecture that cannot be adapted to GPU architecture. FPGWR decomposes large-scale GWR into simpler parallelizable computing units utilizing atomization algorithm and processes them with numerous parallel GPU cores.

In this paper, we develop FPGWR to reduce the computational complexity in the GWR process and enable GWR’s applications in millions or even tens of millions of geodata. This technique significantly improves the efficiency in regression when utilizing the parallelization of large tasks. On the basis of the CUDA framework, atomic subtasks that are decomposed from large tasks could run on a GPU device in parallel mode. This paper contributes to the prior literature as follows. (1) FPGWR can compensate for the deficiencies of GWR in undertaking regression computation for large-scale geodata, and FPGWR with separate atomic computing units (atomization) is more efficient than GWR. (2) FPGWR is a powerful model for exploring spatial heterogeneity and incorporating high parallelism into geography analysis, which is applicable for studies in various fields, such as economic geography, social science, public health. (3) The improvement from GWR to FPGWR can provide new insights into geospatial computing from spatial and computational perspective.

2. GWR Model and Atomization Algorithm

2.1. GWR Review

Before the 1980s, OLR was frequently applied for geographical phenomena analysis. The predictive coefficients

\hat{β}

, calculated by the ordinary least squares (OLS) estimator method, abides by the rule of global optimal unbiased estimation. The final regression result merely reflects the average level in the study region. It is illegitimate to utilize the global regression methods in the local regression model. Therefore, Foster et al. (1986) [28] created a spatial adaptive filter (SAF) learning from varying coefficient modeling, which could describe step-jump and continuous spatial non-stationarity in the coefficients automatically. Based on the local polynomial smoothing technique, Brunsdon et al. (1996) [17] proposed the analysis tool of GWR.

2.1.1. GWR Model

The GWR model extends OLR, introducing the location factor to express the spatial variation of coefficients. In other words, we have the following:

y_{i} = β_{0} (u_{i}, v_{i}) + \sum_{m = 1}^{p} β_{m} (u_{i}, v_{i}) x_{i m} + ε_{i} i = 1, 2, \dots, n

(1)

where

y_{i}

is the regression variable (dependent variable) at location i,

(u_{i}, v_{i})

represents the coordinate (usually latitude and longitude) of the ith sample point in the study area,

β_{m} (u_{i}, v_{i})

denotes the kth coefficient of the ith sample point based on a function with independent variables of

u_{i}

and

v_{i}

,

x_{i m}

expresses the mth predictor variable (independent variable), and

ε_{i}

represents the error term, and n is the sample size. The necessary conditions for Equation (2) can be expressed as follows:

ε_{i} ~ N (0, σ^{2}) \cap C o v (ε_{i}, ε_{j}) = 0 (i \neq j)

(2)

For simplicity, Equation (1) is abbreviated as

y_{i} = \sum_{m = 0}^{p} β_{i m} x_{i m} + ε_{i} i = 1, 2, \dots, n \cap x_{i 0} \equiv 1

(3)

To prevent GWR from degenerating into a general linear regression, it is necessary that

β_{1 m} = β_{2 m} = \dots = β_{n m}

should not appear in the preconditions.

The variables related to GWR can be defined in the form of matrix. The independent variable matrix

X

can be calculated by the following form:

X = [\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 p} \\ 1 & x_{21} & x_{22} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n p} \end{matrix}]

(4)

2.1.2. Spatial Weight Kernel Function

There are

n

terms of spatial weight

w_{i j}

between two sample points (

i = j

is allowed) in the study area. In the GWR model, it is usual to denote the weight matrix

W_{i}

as a diagonal square matrix:

W_{i} = [\begin{matrix} w_{i 1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & w_{i n} \end{matrix}]

(5)

At present, there are several forms of the weight kernel function

w_{i j}

, and the most used are Bi-Square and Gaussian. The two functions can be expressed as Equations (6) and (7):

Bi - Square : w_{i j} = {\begin{matrix} {[1 - {(\frac{d_{i j}}{b w})}^{2}]}^{2}, & d_{i j} < b w \\ 0, & d_{i j} \geq b w \end{matrix}

(6)

Gaussian : w_{i j} = e^{- 1 / 2 {(\frac{d_{i j}}{b w})}^{2}}

(7)

where

d_{i j}

represents the distance between two sample points (

i

and

j

), and

b w

denotes the bandwidth parameter which could be interpreted as not only the neighbors threshold but also the distance attenuation factor within the weight kernel function.

2.1.3. Model Regression

The regression coefficient estimate

{\hat{β}}_{i}

at position

i

is defined by

{\hat{β}}_{i} = {(X^{T} W_{i} X)}^{- 1} X^{T} W_{i} Y

(8)

The regression value

{\hat{Y}}_{i}

of the regression point

i

based on

{\hat{β}}_{i}

can be estimated from

{\hat{Y}}_{i} = X_{i} {(X^{T} W_{i} X)}^{- 1} X^{T} W_{i} Y

(9)

where

X_{i}

represents the ith row vector in matrix

X

. The hat matrix plays a very important role in the residual analysis of the linear regression model. This study introduces the hat matrix

S

into GWR. The matrix

S

can be expressed as follows:

S_{i} = X_{i} {(X^{T} W_{i} X)}^{- 1} X^{T} W_{i}

(10)

The regression result matrix

{\hat{Y}}_{i}

can be represented with the hat matrix

S

:

\hat{Y} = S Y = [\begin{matrix} {\hat{Y}}_{1} \\ ⋮ \\ {\hat{Y}}_{n} \end{matrix}] = [\begin{matrix} S_{1} \\ ⋮ \\ S_{n} \end{matrix}] Y

(11)

2.1.4. The Criteria of Optimal Bandwidth Selection

The key to discovering the optimal bandwidth

b w

is minimizing the

A I C_{c}

score. Loop selection and golden selection methods are available to obtain the lowest

A I C_{c}

value. Searching the optimal bandwidth

b w

is inseparable from the parameter estimation criterion. The criterion

A I C_{c}

[29] is introduced by Brunsdon et al. (2002) [30] to select the optimal bandwidth of the weight function. The specific formula can be expressed as

A I C_{c} = n \ln ({\hat{σ}}^{2}) + n \ln (2 π) + n [\frac{n + t r (S)}{n - 2 - t r (S)}]

(12)

The residual

ε

can be calculated by the sample data

Y

and the regression result

\hat{Y}

:

ε = Y - \hat{Y}

(13)

The unbiased estimate of the random error variance is expressed as

{\hat{σ}}^{2}

:

{\hat{σ}}^{2} = \frac{R S S}{n - 2 t r (S) + t r (S^{T} S)}

(14)

where RSS indicates the sum of squared residuals,

t r (S)

is the trace of the hat matrix

S

, and

n - 2 t r (S) + t r (S^{T} S)

represents the effective freedom degree of GWR. In most cases,

t r (S^{T} S)

approximately equals

t r (S)

(

t r (S^{T} S) \approx t r (S)

), and thereby, the above Equation (14) can be simplified as

{\hat{σ}}^{2} = \frac{R S S}{n - t r (S)}

(15)

2.2. Atomizing the GWR Model

As mentioned above, the regression process of the GWR model involves two fixed steps: optimal bandwidth selection and model diagnosis. Most existing packages that have implemented the GWR algorithm are supported by the serial mode. Compared with the parallel mode, the serial mode carries undesirable consequences to the regression computation. The computing containers with noninfinite computational power will be overloaded with too large-scale samples. The runtime arises along with the sample size growth, following a power or even an exponential relationship [31]. In the paper, it is a feasible solution to design the Algorithm 1 (atomization) in reducing the complexity of GWR regression calculation.

Algorithm 1 Atomic Process—The Minimum Unit of Algorithms.

Atomic Process: Optimizing bandwidth searching by minimizing AIC score

Given test bandwidth (bw) and atomic process index (z)
$Calculate w_{z z}$ $(w_{z z} \equiv 1)$ from Equation (7)
$Loop each a = 1, 2, \dots, p + 1, calculate :$
$Loop each b = 1, 2, \dots, p + 1, calculate :$
$Set B_{a b} = 0$
$Loop each i = 1, 2, \dots, n, calculate :$
$B_{a b} + = x_{i a} \times w_{z i} \times x_{i b}$
End loop
End loop
End loop
$Calculate B^{- 1}$
$Set S_{z} = 0, {\hat{Y}}_{z} = 0$
$Loop each a = 1, 2, \dots, p + 1, calculate :$
$Set temp_x_inv = 0$
$Loop each b = 1, 2, \dots, p + 1, calculate :$
$temp_x_inv + = x_{z b} \times B_{b a}^{- 1}$
End loop
$S_{z} + = temp_x_inv \times x_{z a} \times w_{z z}$
$Set t e m p_x_w = 0$
$Loop each i = 1, 2, \dots, n, calculate :$
$t e m p_x_w + = x_{i a} \times w_{z i}$
End loop
${\hat{Y}}_{z} + = temp_x_inv \times t e m p_x_w$
End loop
$Return S_{z}, {\hat{Y}}_{z}$

2.2.1. Intermediate Matrix

In order to introduce the parallel mode legally, we design GWR atomization to decompose the matrix calculation process. The matrix elements used in the result are extracted on-demand to obtain the result value via simple algebraic calculations. It will save huge memory usage and computing resource occupy in the large matrix operation of GWR. Intermediate matrix is an important research object of GWR atomization, which exists in several common models.

OLR can be calculated by the following matrix form:

Y = X β + ε

(16)

On basis of OLS, regression coefficient

\hat{β}

is estimated from

\hat{β} = {(X^{T} X)}^{- 1} X^{T} Y

(17)

Next, regression result

\hat{Y}

of OLR can be expressed as follows:

\hat{Y} = X {(X^{T} X)}^{- 1} X^{T} Y

(18)

By comprehensively analyzing Equations (8), (9), (17), and (18), we can find the intermediate matrix

M

which exists in all regression models of estimating unbiased via OLS. It can be calculated by the following:

M = {(X^{T} W_{i} X)}^{- 1} X^{T} W_{i} o r M = {(X^{T} X)}^{- 1} X^{T}

(19)

In the point-by-point regression process, the intermediate matrix

M

is inevitable.

Matrix

X^{T}

can be defined by

X^{T} = [\begin{matrix} 1 & 1 & \dots & 1 \\ x_{11} & x_{21} & \dots & x_{n 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1 p} & x_{2 p} & \dots & x_{n p} \end{matrix}] .

(20)

where p is the number of independent variables. The multiplication of matrix

X^{T}

and the diagonal square matrix

W_{i}

is special. The resulting matrix

A

can be expressed as follows:

A = X^{T} W_{i} = [\begin{matrix} 1 \times w_{i 1} & 1 \times w_{i 2} & \dots & 1 \times w_{i n} \\ x_{11} \times w_{i 1} & x_{21} \times w_{i 2} & \dots & x_{n 1} \times w_{i n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1 p} \times w_{i 1} & x_{2 p} \times w_{i 2} & \dots & x_{n p} \times w_{i n} \end{matrix}]

(21)

Similarly, matrix

B

can be written as

B = X^{T} W_{i} X = {[\sum_{a = 1}^{p + 1} (\sum_{b = 1}^{p + 1} (\sum_{j = 1}^{n} (x_{j a} \times w_{z j} \times x_{j b})))]}_{(p + 1) \times (p + 1)}

(22)

where matrix

B

is a square matrix with

p + 1

dimensions. In practical applications,

p + 1

is usually less than 10, which means that it is legal to ignore the time spent by the inverse operation for matrix

B

.

Comparing with matrix decomposition, the regression subprocess of GWR relies on the weight matrix

W_{i}

when calculating matrices

A

and

B

. The determination of weighting scheme

W_{i}

could be achieved: (a) Obtain the coordinate matrix

U V_{(n \times 2)}

of all samples, and then transpose the matrix to matrix

U V_{(2 \times n)}^{T}

. (b) Solve the distance matrix

D_{(n \times n)}

between coordinate matrix

U V_{(n \times 2)}

and its transposed matrix

U V_{(2 \times n)}^{T}

. (c) Calculate the weight matrix

W_{(n \times n)}

of all samples according to Equation (6) or (7), and then

W_{i}

is the diagonal matrix formed by the ith row elements of the weight matrix

W

. However, the process needs huge memory space and calculation time when involving the enormous sample size. In addition, each subprocess of GWR will determine

W_{i}

once, which causes high redundancy of memory and runtime. The implementation of matrix decomposition approach has been carried out to decrease memory usage and runtime occupation by means of Equations (21) and (22).

2.2.2. Implementation of the Atomization Algorithm

Unlike the large process with full-matrix multiplication, each logically independent subprocess merely participates in the regression calculation once, on the basis of the atomization algorithm. It is the prerequisite for parallelization to ensure that the subprocess is repeatable. To address the problems caused by redundant computing, two aspects (memory and time) of optimization are conducted in the study. The

A I C_{c}

scores and estimation

\hat{Y}

, generated during the bandwidth calibration process, are stored in the singleton pattern. Moreover, by means of on-demand computing, the disadvantage of a high-repetition-rate calculation is eliminated in the large process. Given a test bandwidth

b w

, the detailed steps of the atomization algorithm can be implemented as Algorithm 1.

3. CUDA Enabled FPGWR

FPGWR based on CUDA has the capability to process massive spatial geodata. The technique is substantially developed to increase the computing speed of GWR. Supported by a large number of SP, the GPU device can handle parallel computing as a natural carrier of HPC. Hardware performs superior to software in terms of the multithread scheduling. Hence, it is the preferred solution to improve GWR on the basis of the CUDA framework.

3.1. Optimizing the Kernel Function of CUDA

CUDA is a general-purpose parallel computing architecture introduced by the NVIDIA Corporation [32]. In the CUDA framework, parallel tasks would be instantiated as independent controllable threads. Independence means that there are no mutually exclusive signals among all threads. Each thread could run synchronously without depending on its sibling threads. Controllability means that the specificity of the thread instances could be controlled by the same parameters. The initialization values are differently set to make the generated instances diverse from each other. Due to the identical computing processes of threads, merely one thread scheduler is needed to manage all threads.

There are two principles for designing the CUDA kernel function to maximize the usage of the GPU scheduling resource and computing cycle. We should minimize the occurrence of WARP Branch in the kernel function as much as possible. At the same time, it is recommended to choose the CUDA memory type flexibly. The specific optimization strategy is shown in Figure 1. By the method of matrix decomposition, the atomic kernel function has successfully prevented process branching. Hence the computation workload can be evened out among the threads. Each atomic task will dynamically be assigned one unique thread index (

z

) that is different from the others. Because the tasks execute in a completely random order, the coupling relationship between the atomic threads and SPs is disconnected. To overcome the performance bottleneck caused by frequent access to global memory, FPGWR utilizes the shared memory to store these temporary variables.

3.2. Implementing FPGWR Based on CUDA

In this study, we have implemented FPGWR in a CUDA framework by utilizing the method of atomization. FPGWR significantly shortens the total time of large-scale GWR regression and releases the memory space of massive spatial matrix data. The FPGWR implementation consists of five steps. Step 1, the program in Host device invokes the GPU device to be prepared, and at the same time, a series of initial parameters are set in the constant memory of the GPU. Step 2, the sample data are input to the global memory of the GPU. The volume of geodata is too enormous to be instantiated in either the shared memory or the local memory. It will throw an “out of memory (OOM)” error when sample data volume is too large to fit in GPU global memory. Step 3, CUDA loads the instructions compiled from the code of the atomic kernel function, and then, the scheduler generates individual threads with the same kernel function. Step 4, all threads are assigned to Streaming Multiprocessor (SM) in the unit of WARP. To address the enormous number of threads, the GPU will activate the flow-shop scheduling mode. Step 5, CUDA feeds back the regression results from the GPU to the Host, and the GPU device resources are released immediately. The detailed workflow of the FPGWR implementation is shown in Figure 2.

As shown in Figure 2a, the FPGWR algorithm could be divided into four layers: data layer, input layer, working layer, and output layer. The data layer is dedicated to storing the files of original observations. The input layer reads the spatial observation data from hard disk into host memory. At the same time, the part of the CUDA programming is instantiated in this layer. The initialization parameters and observation matrices are introduced together into the atomic kernel function, and then, the function will be compiled into an executable program. The working layer runs on the NVIDIA GPU. It starts massive task threads, which are managed uniformly by the multithreaded scheduler of the GPU. At the physical level, WARPs are bundled into a queue of batches, while the WARPs in the same batch are executed synchronously. The output layer is designed to collect the regression results. Based on the bandwidth indexes, these results are organized into multiple sets of regression matrices (

S

,

\hat{Y}

, and

\hat{β}

). Finally, the algorithm finds the optimal results set that corresponds to the minimum

A I C_{c}

score.

Figure 2b,c illustrate how the core part of FPGWR works at the micro level. The specific meanings of the initialization parameters (

n

,

p

and

b w s

) and the prototype of the FPGWR_KERNEL function are described in Subfigure (b). The detailed process of FPGWR_KERNEL function is presented in Subfigure (c). The steps of the process could correspond to those of Algorithm 1. The multithread scheduling depends on the initial BLOCK and GRID settings of the kernel function in CUDA. BLOCK is set as a one-dimensional vector with a constant value (64), namely, each BLOCK contains 64 threads. GRID is set as a two-dimensional vector, in which the number of the first dimension is the sample size

n

divided by 64 (number of BLOCK’s first dimension), and the second dimension is the size of the bandwidth array.

4. Results and Discussion

4.1. Data Source

To explore the real performance of FPGWR, three data sources—the simulation dataset, the “Zillow test dataset” [26], and the “Georgia” dataset [33]—are used for the experiment. The simulation dataset is designed to evaluate the influences caused by the sample size and the independent variables size. The “Zillow test dataset” (https://github.com/Ziqi-Li/FastGWR) is assigned to compare the acceleration performance of the different GWR packages. The “Georgia” dataset is used to validate the result accuracy of FPGWR against other schemes.

4.1.1. Simulation Dataset

The test region is displayed as a square area [34] with

l

length sides, where the sample points are distributed evenly. After setting the sample size of each row to

c

, the total number of samples could be expressed as

n = c \times c

. The distance between two adjacent samples is calculated by

Δ l = l / (c - 1)

. The lower-left corner is defined as the origin of the coordinate system. The expression for calculating the positions of the samples is given by

(u_{i}, v_{i}) = (Δ l \times m o d (\frac{i - 1}{c}), Δ l \times f l o o r (\frac{i - 1}{c}))

(23)

where mod stands for the remainder function, and floor denotes the rounding function.

The sample data are generated by the GWR model below. It is predefined in Equation (24) as follows:

y_{i} = β_{0} (u_{i}, v_{i}) + β_{1} (u_{i}, v_{i}) x_{i 1} + β_{2} (u_{i}, v_{i}) x_{i 2} + β_{3} (u_{i}, v_{i}) x_{i 3} + β_{4} (u_{i}, v_{i}) x_{i 4} + ε_{i}

(24)

To unify the dimensions of the regression coefficients

β

, all of the values are limited to the interval

(0, β_{m a x})

(

β_{m a x}

is a fixed constant). The coefficients

β

follow 5 functions as follows:

β_{0} (u_{i}, v_{i}) = \frac{2 β_{m a x}}{l^{2}} (\frac{l^{2}}{2} - {(l - u_{i})}^{2} - {(l - v_{i})}^{2})

(25)

β_{1} (u_{i}, v_{i}) = \frac{β_{m a x}}{2} ({(\sin \frac{u_{i} π}{l})}^{2} + {(\sin \frac{v_{i} π}{l})}^{2})

(26)

β_{2} (u_{i}, v_{i}) = \frac{β_{m a x}}{2} (2 - ({(\tan (\frac{u_{i} π}{2 l} - \frac{π}{4}))}^{2} + {(\tan (\frac{v_{i} π}{2 l} - \frac{π}{4}))}^{2}))

(27)

β_{3} (u_{i}, v_{i}) = β_{m a x} e^{- \frac{1}{2 l} ({(\frac{l}{2} - u_{i})}^{2} + {(\frac{l}{2} - v_{i})}^{2})}

(28)

β_{4} (u_{i}, v_{i}) = \frac{16 β_{m a x}}{l^{4}} (\frac{l^{2}}{4} - {(\frac{l}{2} - u_{i})}^{2}) (\frac{l^{2}}{4} - {(\frac{l}{2} - v_{i})}^{2})

(29)

The spatial distribution of the coefficients

β

is displayed in Figure 3. The five coefficients

β

selected by the model are closely related to the position of the sample, which demonstrates the spatial non-stationarity of the observations.

According to Formula (27), eight sets of sample datasets are produced for testing. The datasets’ construction parameters and resource link are exhibited in Table 1.

4.1.2. Zillow Test Dataset

The “Zillow test dataset” [26] is a subset of the Zillow property dataset, which consists of the single-family housing information within the metropolitan area of Los Angeles. The mathematical expression of the dataset is expressed as Equation (30). The dataset is open source on GitHub along with the FastGWR algorithm (https://github.com/Ziqi-Li/FastGWR). This paper has downloaded eight datasets (1 k, 2 k, 5 k, 10 k, 15 k, 20 k, 50 k and 100 k) from the GitHub repository for the comparative experiment.

V a l u e_{i} = β_{i 0} + β_{i 1} A r e a_{i} + β_{i 2} N b a t h s_{i} + β_{i 3} N b e d s_{i} + β_{i 4} A g e_{i} + ε_{i}

(30)

4.1.3. Georgia Dataset

The Georgia dataset [33] contains a subset (socio-demographic characteristics) of the 1990 US census within the state of Georgia. The coordinates of the data points are set at the centroids of counties, so there are 159 records containing county population attributes in the dataset. The model could be defined as Equation (31):

P c t B a c h = β_{0} + β_{1} I n t e r c e p t + β_{2} P c t P o v + β_{3} P c t R u r a l + β_{4} P c t B l a c k + ε

(31)

4.2. Testing Specifications and Environment

The experiment for FPGWR is conducted on a desktop computer. The configuration of this computer is an Intel i7-9700K 3.60 GHz 8-core CPU (Intel Corporation, Santa Clara, CA, USA), 16 GB Random Access Memory (RAM) (Kingston Technology Corporation, Fountain Valley, CA, USA) and NVIDIA GeForce RTX 2080 Ti 11 GB GPU (NVIDIA Corporation, San Tomas Expressway Santa Clara, CA, USA). In addition, it has installed version 10.2.95 of the CUDA development kit, version 14.0.25431.01 Update 3 of Microsoft Visual Studio 2015 (development IDE), and the Microsoft Windows 10.0.17134 Professional Edition operating system (OS). Note that 4352 SP core units are placed in the GPU device. Relying on its ultra-high-speed task scheduling capability, the GPU can withstand the pressure of multithreaded computation tasks in parallel.

4.3. Results

4.3.1. FPGWR Performance

Due to the transformation of parallelization, the regression efficiency of FPGWR increases dramatically against GWR. The bandwidth optimization is clearly a repetitive process. To compare the acceleration performances, this study analyzes only the single subprocess with a fixed bandwidth. The runtimes of FPGWR with different sample sizes are shown in Table 2. Given four independent variables, the runtime can be controlled within 2 s when the sample size is less than 40 k. After increasing the sample size to 250 k, the runtime becomes approximately 66.6 s. As the sample size increases to the millions scale, it spends only approximately 1094.7 s. The result shows that the runtime obeys a logarithmic variation rule as the sample size changes.

The runtimes vary tremendously with different sample sizes. To enable the display of all results together, the y-axis of the logarithmic scale is plotted in Figure 4. The comprehensive analysis of Table 2 and Figure 4 reveals that both the sample size and the number of independent variables can influence the variation of runtime. The regression time has positive association with the number of independent variables. When the sample size varies, the variation in the runtime is similar with different numbers of independent variables. Given the same sample size, the results on the time exhibit a simple multiple relationship among the different numbers of independent variables. In summary, the sample size has a more pronounced impact than the number of independent variables on the regression time.

Speed-up and efficiency are important metrics for the performance check of parallel algorithms [35]. Speed-up refers to the ratio of single processor runtime to multiprocessor runtime, and efficiency represents the average of speed-up in multiprocessors [36]. The speed-up of FPGWR relies on the GPU performance, which consists of SP number, base clock frequency, and memory bandwidth. Table 3 compares the performance configuration of different type GPUs. On basis of the 250,000 simulation samples, this study regresses the model given in equation (24) with an increasing number of GPU cores. The runtime of GTX1050 is set as the benchmark value to calculate the speed-up factor of GPUs with different SP numbers. The speed-ups growth proves that FPGWR has an outstanding parallel scalability. Figure 5 illustrates that the computation time decreases approximate-linearly as the number of GPU cores increases. It exhibits an obvious positive linear relationship between efficiency and GPU cores.

The experimental result demonstrates the outstanding capability of FPGWR to accelerate GWR, although its performance varies slightly among different orders of magnitude of observations. By setting appropriate sample sizes and independent variable numbers, the full potential of FPGWR can be achieved in various fields.

4.3.2. Comparison of FPGWR and Other GWR

Benefiting from the development of computer hardware, researchers could quickly and easily build the GPU environment for large-scale spatial study. To verify the acceleration capability, another four GWR—namely, FastGWR (Python), MGWR (Python), GWmodel (R), and spgwr (R)—are selected to compare with FPGWR. The test data utilized by the experiment is the “Zillow test dataset.” GWmodel uses moving window weighting technique to decrease the computation. FastGWR implements distributed parallelism in HPC environment to improve operating efficiency. FastGWR is superior than MGWR, GWmodel and spgwr in terms of overall calculating efficiency. As a side note, although the optimal environment for FastGWR is an HPC cluster, it is more unbiased to conduct the experiments based on a single desktop environment.

The runtimes of the five packages with different sample sizes are displayed in Table 4. The runtime merely contains the single regression time with a specified bandwidth (as in Section 4.3.1). Given 1000 observations, FPGWR is 5 times faster than FastGWR, 32 times faster than MGWR, 88 times faster than GWmodel, and 865 times faster than spgwr. As the sample size increases to 10,000, FPGWR is approximately 14 times faster than FastGWR, approximately 157 times faster than MGWR, approximately 2185 times faster than GWmodel, and approximately 45,811 times faster than spgwr. Once the sample size exceeds 20,000, spgwr will fail to complete the regression task first, followed by GWmodel and MGWR. The cause is that the three schemes fail to avoid storing the high-dimensional weight matrix and other intermediate matrices.

The runtimes of the five packages are illustrated in Figure 6. The y-axis is marked on a logarithmic scale to display all of the results together. Observing each package separately reveals that the runtimes of the five schemes all exhibit a logarithmic increasing trend. According to Figure 6, the performances of the five packages are enhanced generation by generation. FPGWR is the most ideal implementation among these schemes.

Overall, FPGWR is a feasible GWR accelerator with a low development cost and simple productization process. Compared with other packages, FPGWR can greatly simplify a complicated job through decomposing the redundant full-sample regression.

4.3.3. Validation of the Result Accuracy

To validate the accuracy of FPGWR against other GWR packages, the results (

\hat{β}

,

A d j . R^{2}

and

A I C_{c}

scores) of the five packages are compared based on the well-known “Georgia” dataset. The dependent variable PctBach and independent variables Intercept, PctPov, PctRural and PctBlack are chosen to calibrate the same GWR model according to Equation (31). On the basis of the adaptive Bi-square kernel function, 93 nearest neighbors are selected for the optimal bandwidth. As shown in Table 5, the Mean and Standard Deviation of the estimated coefficients

\hat{β}

are displayed in the middle section, and the

A d j . R^{2}

and

A I C_{c}

scores are indicated in the lower section. The FPGWR result is clearly consistent with those of the other four packages.

The spatial distribution of the estimated coefficients

\hat{β}

in the study area is illustrated in Figure 7. To simulate the spatial variation better, both the surfaces are interpolated as a continuous surface utilizing the griddata method.

4.4. Discussion

Multiple loops are necessary in the GWR model until FPGWR atomizes the large process. Enormous calculation redundancy would inevitably emerge in the implementation of the GWR algorithm. The algorithm structure is nested with multilevel loops, where the upper loop depends on the lower loop and the internal sequence of each loop is fixed. It is illegitimate to disturb the original iterative sequence of the subprocesses; otherwise, the accuracy of the regression results would be questioned seriously. FPGWR introduces a hybrid (parallel–serial) mode, which could enable the GPU device to not only tolerate parallel tasks of each batch but also complete all of the tasks efficiently. The subprocesses could be randomly executed without errors, and the accuracy of the results is guaranteed for the model diagnosis. FPGWR differs obviously from GWR in its memory usage and time cost.

4.4.1. Memory

The matrix storage strategy of GWR is different from FPGWR, as shown in Figure 8. The FPGWR optimizes the storage mode in utilizing the schemes, the on-demand storage and the matrix vectorization. The weight matrix

W_{i}

is stored as an

n \times n

diagonal matrix in the GWR model. Although only the diagonal elements must be solved, a storage space of size

n^{2}

is demanded. The memory complexity could be expressed as

O (n^{2})

. In the subsequent steps, the calculations of the matrices

B

,

B^{- 1}

and

A

all inherit the memory complexity. In comparison, the FPGWR method only stores the data as required. Its memory complexity can be reduced into

O ((p + 1) n)

(

p + 1 \leq 10

in common).

Table 6 illustrates the comparison of memory usage between FPGWR and GWR. When the sample size is less than 100,000, the memory of the GPU device is still available for usage. Once the size is increased to 10,000,000, GWR approximately requires 364 TB RAM. Any existing single GPU device could not allocate so much storage space. In contrast, FPGWR consumes only 380 MB RAM. To summarize, FPGWR demonstrates a tremendous advantage over GWR.

.

4.4.2. Time

It is necessary for the process of calculating the hat matrix

S

in GWR. The weight matrix

W_{i}

is a special diagonal matrix that does not increase the runtime complexity during matrix multiplication. The runtime complexity of matrices

A

is

O (1)

, and the runtime complexity of

B

is

O ({(p + 1)}^{2} n)

. Because

p + 1

is far less than

n

, the computing time of matrix

B

can be ignored. Given a fixed weight bandwidth, the runtime complexity of hat matrix

S_{i}

can be expressed as

O ({(p + 1)}^{2} n)

. After

n

operations on

S_{i}

, the runtime complexity of matrix

S

is defined by

O ({(p + 1)}^{2} n^{2})

. Different GWR schemes utilize varied ways to select the optimal bandwidth, and thereby, the study will omit a discussion about the runtime complexity of the whole process. The regression subprocess of single sample is appropriate to be used for the runtime analysis in this subsection. Instead of computing iteratively, FPGWR atomizes the regression process of each point as an independent thread. The strategy has reduced the runtime complexity of matrix

S

appreciably. At the same time, the runtime complexity of matrix of

A

becomes

O ((p + 1) n)

, and the runtime complexity of matrix of

B

becomes

O ({(p + 1)}^{2} n)

. By combining matrices

A

and

B

in the form of parallel addition, the hat matrix

S_{i}

gains a runtime complexity of

O ((p + 1) (p + 2) n)

. The runtime complexity of FPGWR is theoretically lower than GWR, but the instruction operation efficiency of host device differs significantly from the GPU device, and the methods used by the different libraries to optimize the matrix operation are inconsistent. Therefore, the actual comparison results should refer to the experimental results (in Section 4.3.2).

The achievement of GWR regression coefficients requires consuming much time for repeated iteration when handling big geodata. For example, when the sample size is 1,000,000, single point regression of GWR needs to be iterated for 1,000,000 times with a huge time occupying and memory usage. Therefore, this problem could be solved by parallelization strategy. The atomization algorithm does not store the weight matrix and other temporary matrices during each point regression iteration, but only reads and calculates the matrix elements on-demand. FPGWR shortens computation time while using much less memory space through parallelizing these atomic units on CUDA.

5. Conclusions

GWR is a local modeling technique that has been widely used in various disciplines. However, GWR has significant computational redundancy and can handle approximately 15,000 geographical observations at most. To apply the local smoothing technique on a large-scale spatial dataset, we proposed an improved algorithm FPGWR to solve these problems. FPGWR optimizes the matrix storage mode to overcome the limitation on memory space, thereby significantly reducing the memory complexity of GWR. Furthermore, it introduces a parallel computing mode, decomposing the full-sample large cycle into an atomization process, to decrease the runtime complexity substantially.

To demonstrate the practicability of FPGWR, simulation and Zillow datasets are used to conduct the experiment. The results show that the regression runtime is exponentially related to the number of observations, and thus, GWR is unable to process the regression task with large volumes of geodata. In comparison, the time taken up by FPGWR exhibits a logarithmic relationship with the number of observations; hence, FPGWR represents a significant advance in handling the massive geodata mining task.

In summary, the dilemma that limits GWR in the data scale could be considerably alleviated by FPGWR, and thus, the application domains of GWR would be potentially expanded to a large extent. Under these circumstances, increasingly large datasets from geographical or nongeographical fields could be converted to the providers of the large-scale geographic analysis services. In the future, we will investigate a key issue: how to adapt FPGWR to non-CUDA architectures, even other non-GPU HPC devices, to enhance the versatility of the extended algorithm.

Author Contributions

Conceptualization, Dongchao Wang, and Yi Yang; Methodology, Dongchao Wang; Resources, Dongchao Wang; Software, Dongchao Wang; Validation, Dongchao Wang; Writing—original draft, Dongchao Wang; Writing—review and editing, Dongchao Wang, Yi Yang, Agen Qiu, Xiaochen Kang, Jiakuan Han, and Zhengyuan Chai. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Project (grant number 2019YFB2102500/2019YFB2102503), the National Natural Science Foundation of China (grant number 71903183, 41801316, 41701461), and the Basic Scientific Research Fund of CASM (grant number AR1910).

Conflicts of Interest

The authors declare no conflict of interest.

References

Toch, E.; Lerner, B.; Ben-Zion, E.; Ben-Gal, I. Analyzing large-scale human mobility data: A survey of machine learning methods and applications. Knowl. Inf. Syst. 2019, 58, 501–523. [Google Scholar] [CrossRef]
Weckström, C.; Kujala, R.; Mladenović, M.N.; Saramäki, J. Assessment of large-scale transitions in public transport networks using open timetable data: Case of Helsinki metro extension. J. Transp. Geogr. 2019, 79, 102470. [Google Scholar] [CrossRef]
Hicks, J.L.; Althoff, T.; Sosic, R.; Kuhar, P.; Bostjancic, B.; King, A.C.; Leskovec, J.; Delp, S.L. Best practices for analyzing large-scale health data from wearables and smartphone apps. NPJ Digit. Med. 2019, 2, 1–12. [Google Scholar] [CrossRef]
Tasar, O.; Tarabalka, Y.; Alliez, P. Incremental learning for semantic segmentation of large-scale remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3524–3537. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Huang, Q.; Jiang, Y.; Hu, F. SOVAS: A scalable online visual analytic system for big climate data analysis. Int. J. Geogr. Inf. Sci. 2020, 34, 1188–1209. [Google Scholar] [CrossRef]
Miller, H.J.; Goodchild, M.F. Data-driven geography. GeoJournal 2015, 80, 449–461. [Google Scholar] [CrossRef]
Xia, J.; Huang, S.; Zhang, S.; Li, X.; Lyu, J.; Xiu, W.; Tu, W. DAPR-tree: A distributed spatial data indexing scheme with data access patterns to support Digital Earth initiatives. Int. J. Digit. Earth 2020, 1–16. [Google Scholar] [CrossRef]
Aji, A.; Wang, F.; Vo, H.; Lee, R.; Liu, Q.; Zhang, X.; Saltz, J. Hadoop-GIS: A high performance spatial data warehousing system over MapReduce. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Trento, Italy, 26–30 August 2013; Volume 6. [Google Scholar]
Wu, Q.Y.; Su, K.Y.; Zou, Z.J. A mapreduce-based method for parallel calculation of bus passengers origin and destination from massive transit data. J. Geo Inf. Sci. 2018, 20, 647–655. [Google Scholar]
Wilkinson, B.; Allen, M. Parallel Programming; Prentice Hall: Upper Saddle River, NJ, USA, 1999. [Google Scholar]
Gong, Z.; Tang, W.; Bennett, D.A.; Thill, J.-C.F. Parallel agent-based simulation of individual-level spatial interactions within a multicore computing environment. Int. J. Geogr. Inf. Sci. 2013, 27, 1152–1170. [Google Scholar] [CrossRef]
Tang, W.; Feng, W.; Jia, M. Massively parallel spatial point pattern analysis: Ripley’s K function accelerated using graphics processing units. Int. J. Geogr. Inf. Sci. 2015, 29, 412–439. [Google Scholar] [CrossRef]
Zhang, G.; Zhu, A.X.; Huang, Q. A GPU-accelerated adaptive kernel density estimation approach for efficient point pattern analysis on spatial big data. Int. J. Geogr. Inf. Sci. 2017, 31, 2068–2097. [Google Scholar] [CrossRef]
Sandric, I.; Ionita, C.; Chitu, Z.; Dardala, M.; Irimia, R.; Furtuna, F.T. Using CUDA to accelerate uncertainty propagation modelling for landslide susceptibility assessment. Environ. Model. Softw. 2019, 115, 176–186. [Google Scholar] [CrossRef]
Stojanovic, N.; Stojanovic, D. Parallelizing multiple flow accumulation algorithm using cuda and openacc. ISPRS Int. J. Geo Inf. 2019, 8, 386. [Google Scholar] [CrossRef] [Green Version]
Pei, T.; Song, C.; Guo, S.; Shu, H.; Liu, Y.; Du, Y.; Ma, T.; Zhou, C. Big geodata mining: Objective, connotations and research issues. J. Geogr. Sci. 2020, 30, 251–266. [Google Scholar] [CrossRef]
Brunsdon, C.; Fotheringham, A.S.; Charlton, M.E. Geographically weighted regression: A method for exploring spatial nonstationarity. Geogr. Anal. 1996, 28, 281–298. [Google Scholar] [CrossRef]
Zhang, P.; Yang, D.; Zhang, Y.; Li, Y.; Liu, Y.; Cen, Y.; Zhang, W.; Geng, W.; Rong, T.; Liu, Y.; et al. Re-examining the drive forces of China’s industrial wastewater pollution based on GWR model at provincial level. J. Clean. Prod. 2020, 262, 121309. [Google Scholar] [CrossRef]
Wu, D. Spatially and Temporally Varying Relationships between Ecological Footprint and Influencing Factors in China’s Provinces Using Geographically Weighted Regression (GWR). J. Clean. Prod. 2020, 261, 121089. [Google Scholar] [CrossRef]
Yuan, Y.; Cave, M.; Xu, H.; Zhang, C. Exploration of spatially varying relationships between Pb and Al in urban soils of London at the regional scale using geographically weighted regression (GWR). J. Hazard. Mater. 2020, 393, 122377. [Google Scholar] [CrossRef]
Hong, I.; Yoo, C. Analyzing Spatial Variance of Airbnb Pricing Determinants Using Multiscale GWR Approach. Sustainability 2020, 12, 4710. [Google Scholar] [CrossRef]
Wu, S.; Wang, Z.; Du, Z.; Huang, B.; Zhang, F.; Liu, R. Geographically and temporally neural network weighted regression for modeling spatiotemporal non-stationary relationships. Int. J. Geogr. Inf. Sci. 2020, 1–27. [Google Scholar] [CrossRef]
Bivand, R.; Yu, D.; Nakaya, T.; Garcia-Lopez, M.A. Package SPGWR; R Software Package; R Foundation for Statistical Computing: Vienna, Austra, 2020. [Google Scholar]
Gollini, I.; Lu, B.; Charlton, M. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. J. Stat. Softw. 2015, 63, 1–50. [Google Scholar] [CrossRef] [Green Version]
Oshan, T.M.; Li, Z.; Kang, W.; Wolf, L.J.; Fotheringham, A.S. mgwr: A Python implementation of multiscale geographically weighted regression for investigating process spatial heterogeneity and scale. ISPRS Int. J. Geo Inf. 2019, 8, 269. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Fotheringham, A.S.; Li, W.; Oshan, T. Fast Geographically Weighted Regression (FastGWR): A scalable algorithm to investigate spatial process heterogeneity in millions of observations. Int. J. Geogr. Inf. Sci. 2019, 33, 155–175. [Google Scholar] [CrossRef]
Tran, H.T.; Nguyen, H.T.; Tran, V.T. Large-scale geographically weighted regression on Spark. In Proceedings of the 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE), Hanoi, Vietnam, 6–8 October 2016; pp. 127–132. [Google Scholar]
Foster, S.A.; Gorr, W.L. An adaptive filter for estimating spatially-varying parameters: Application to modeling police hours spent in response to calls for service. Manag. Sci. 1986, 32, 878–889. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar]
Brunsdon, C.; Fotheringham, A.S.; Charlton, M. Geographically weighted summary statistics—A framework for localised exploratory data analysis. Comput. Environ. Urban Syst. 2002, 26, 501–524. [Google Scholar] [CrossRef] [Green Version]
Harris, R.; Singleton, A.; Grose, D.; Brundson, C.; Longley, P. Grid-enabling geographically weighted regression: A case study of participation in higher education in England. Trans. GIS 2010, 14, 43–61. [Google Scholar]
NVIDIA Corporation. Compute Unified Device Architecture (CUDA). Available online: https://developer.nvidia.com/cuda-toolkit (accessed on 6 October 2020).
Fotheringham, A.S.; Brunsdon, C.; Charlton, M. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
Zhang, H.; Mei, C. Local least absolute deviation estimation of spatially varying coefficient models: Robust geographically weighted regression approaches. Int. J. Geogr. Inf. Sci. 2011, 25, 1467–1489. [Google Scholar]
Eager, D.L.; Zahorjan, J.; Lazowska, E.D. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 1989, 38, 408–423. [Google Scholar]
Yang, L.; Sun, X.; Li, Z. An efficient framework for remote sensing parallel processing: Integrating the artificial bee colony algorithm and multiagent technology. Remote Sens. 2019, 11, 152. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Optimizing the compute unified device architecture (CUDA) kernel function.

Figure 2. Flow diagram of FPGWR on CUDA (a–c).

Figure 3. Coefficient (

β_{0}

,

β_{1}

,

β_{2}

,

β_{3}

and

β_{4}

) surface.

Figure 3. Coefficient (

β_{0}

,

β_{1}

,

β_{2}

,

β_{3}

and

β_{4}

) surface.

Figure 4. Runtime comparison (bar) for different numbers of coefficients using different numbers of data points.

Figure 5. Performance comparison of FPGWR for an increasing number of GPU cores.

Figure 6. Runtime comparison (bar) for different packages using different numbers of data points.

Figure 7. Surfaces of coefficient estimates.

Figure 8. Different storage modes of weight matrix between classical GWR and FPGWR.

Table 1. Construction parameters of example data and access entrance of resource.

$l$	$c$	$β_{m a x}$	$x_{m a x}$	$σ$	Number of Data Points
10	10	4	2	0.5	100
10	40	4	2	0.5	1600
10	80	4	2	0.5	6400
10	100	4	2	0.5	10,000
10	200	4	2	0.5	40,000
10	500	4	2	0.5	250,000
10	1000	4	2	0.5	1,000,000
10	2000	4	2	0.5	4,000,000

Note: Resource URL: https://pan.baidu.com/s/1c0Ga8Ngej0SG990sdxHQ_A. Access code: 8dve.

Table 2. Runtime (in seconds) for different numbers of coefficients using different numbers of data points.

Number of Data Points	Four Independent Variables	Three Independent Variables	Two Independent Variables
100	0.003	0.002	0.001
1600	0.022	0.017	0.011
6400	0.095	0.068	0.045
10,000	0.186	0.126	0.063
40,000	1.867	1.256	0.766
250,000	66.616	46.386	26.154
1,000,000	1094.654	738.636	421.922

Table 3. Performance comparison for different types of graphics processing unit (GPU).

Type of NVIDIA GPU	SP Number	Base Clock Frequency (MHz)	Memory Bandwidth (GB/s)	Runtime (ms)	Speed-Up Factor
GTX 1050	640	1354	84	568,086	1
GTX 1060	1280	1506	192	391,968	1.45
GTX 1070	1920	1506	256	232,801	2.44
RTX 2070	2304	1410	448	129,665	4.38
RTX 2080	2944	1515	448	92,141	6.17
RTX 2080 Ti	4352	1350	616	65,916	8.62

Table 4. Runtime (in seconds) for five GWR packages using different numbers of data points.

Number of Data Points	FPGWR	FastGWR	MGWR	GWmodel	Spgwr
1000	0.01	0.05	0.32	0.88	8.65
2000	0.02	0.13	0.95	4.12	60.26
5000	0.06	0.47	5.43	53.15	1095.80
10,000	0.18	2.45	28.34	393.30	8245.93
15,000	0.27	4.42	58.97	1464.12	n/a
20,000	0.50	6.76	n/a	n/a	n/a
50,000	2.72	64.57	n/a	n/a	n/a
100,000	10.80	307.12	n/a	n/a	n/a

Table 5. Statistical results of local coefficient estimates and regression estimates for five GWR packages.

Variables	FPGWR	FastGWR	MGWR	GWmodel	Spgwr
	Mean
	Standard Deviation
$Intercept$	23.0748	23.0748	23.0748	23.0748	23.0748
$Intercept$	4.1048	4.1048	4.1048	4.1048	4.1048
$PctPov$	−0.2625	−0.2625	−0.2625	−0.2625	−0.2625
$PctPov$	0.0916	0.0916	0.0916	0.0916	0.0916
$PctRural$	−0.1181	−0.1181	−0.1181	−0.1181	−0.1181
$PctRural$	0.0370	0.0370	0.0370	0.0370	0.0370
$PctBlack$	0.0445	0.0445	0.0445	0.0445	0.0445
$PctBlack$	0.0576	0.0576	0.0576	0.0576	0.0576
$PctBach$	10.9363	10.9363	10.9363	10.9363	10.9363
$PctBach$	4.3489	4.3489	4.3489	4.3489	4.3489
	Value
$Adj . R^{2}$	0.5812	0.5812	0.5812	0.5812	0.5812
$AICc$	896.35	896.35	896.35	896.35	896.35

Table 6. The comparison of memory usage for FPGWR against classical GWR.

Number of Data Points	FPGWR	Classical GWR
100	3.9 KB	39 KB
1000	39 KB	3.8 MB
10,000	390 KB	380 MB
100,000	3.8 MB	38 GB
1,000,000	38 MB	3.8 TB
10,000,000	380 MB	364 TB

Note: 32-bit floats were used for all decimals and

p = 9

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, D.; Yang, Y.; Qiu, A.; Kang, X.; Han, J.; Chai, Z. A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data. ISPRS Int. J. Geo-Inf. 2020, 9, 653. https://doi.org/10.3390/ijgi9110653

AMA Style

Wang D, Yang Y, Qiu A, Kang X, Han J, Chai Z. A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data. ISPRS International Journal of Geo-Information. 2020; 9(11):653. https://doi.org/10.3390/ijgi9110653

Chicago/Turabian Style

Wang, Dongchao, Yi Yang, Agen Qiu, Xiaochen Kang, Jiakuan Han, and Zhengyuan Chai. 2020. "A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data" ISPRS International Journal of Geo-Information 9, no. 11: 653. https://doi.org/10.3390/ijgi9110653

APA Style

Wang, D., Yang, Y., Qiu, A., Kang, X., Han, J., & Chai, Z. (2020). A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data. ISPRS International Journal of Geo-Information, 9(11), 653. https://doi.org/10.3390/ijgi9110653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data

Abstract

1. Introduction

2. GWR Model and Atomization Algorithm

2.1. GWR Review

2.1.1. GWR Model

2.1.2. Spatial Weight Kernel Function

2.1.3. Model Regression

2.1.4. The Criteria of Optimal Bandwidth Selection

2.2. Atomizing the GWR Model

2.2.1. Intermediate Matrix

2.2.2. Implementation of the Atomization Algorithm

3. CUDA Enabled FPGWR

3.1. Optimizing the Kernel Function of CUDA

3.2. Implementing FPGWR Based on CUDA

4. Results and Discussion

4.1. Data Source

4.1.1. Simulation Dataset

4.1.2. Zillow Test Dataset

4.1.3. Georgia Dataset

4.2. Testing Specifications and Environment

4.3. Results

4.3.1. FPGWR Performance

4.3.2. Comparison of FPGWR and Other GWR

4.3.3. Validation of the Result Accuracy

4.4. Discussion

4.4.1. Memory

4.4.2. Time

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI