Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning

Chen, Yuxuan; Ou, Feifei; Liu, Qiliang; Wu, Gusheng; Chen, Kaiqi; Deng, Min; Chen, Meihua; Xu, Rui

doi:10.3390/app142210704

Open AccessArticle

Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning

by

Yuxuan Chen

¹

,

Feifei Ou

¹,

Qiliang Liu

¹

,

Gusheng Wu

¹

,

Kaiqi Chen

^1,*,

Min Deng

¹,

Meihua Chen

¹ and

Rui Xu

²

¹

School of Geosciences and Info-Physics, Central South University, Changsha 410083, China

²

Guizhou Survey & Design Research Institute for Water Resources and Hydropower, Guiyang 550002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10704; https://doi.org/10.3390/app142210704

Submission received: 11 October 2024 / Revised: 8 November 2024 / Accepted: 12 November 2024 / Published: 19 November 2024

(This article belongs to the Special Issue Crowd-Sourced Data and Deep Learning in Remote Sensing: Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Geospatial knowledge graphs provide critical technology for integrating geographic information and semantic knowledge, which are very useful for geographic data analysis. As the scale of geospatial knowledge graphs continues to grow, the distributed management of geospatial knowledge graphs is becoming an inevitable requirement. Geospatial knowledge graph partitioning is the core technology for the distributed management of geospatial knowledge graphs. To support geographic data analysis, spatial relationships between entities should be considered in the application of geospatial knowledge graphs. However, existing knowledge graph partitioning methods overlook the spatial relationships between entities, resulting in the low efficiency of spatial queries. To address this issue, this study proposes a geospatial knowledge graph partitioning method based on dual clustering which performs two different clustering methods step by step. First, the density peak clustering method (DPC) is used to cluster geographic nodes. The nodes within each cluster are merged into a super-node. Then, we use an efficient graph clustering method (i.e., Leiden) to identify the community structure of the graph. Nodes belonging to the same community are further merged to reduce the size of the graph. Finally, partitioning operations are performed on the compressed graph based on the idea of the Linear-Weighted Deterministic Greedy Policy (LDG). We construct a geospatial knowledge graph based on YAGO3 to evaluate the performance of the proposed graph partitioning method. The experimental results show that the proposed method outperforms ten comparison methods in terms of graph partitioning quality and spatial query efficiency.

Keywords:

graph partitioning; geospatial knowledge graph; spatial clustering; community detection; spatial relationship

1. Introduction

Geospatial knowledge graphs adopt structured graphs to effectively organize geographic knowledge by describing geographic concepts, geographic entities, and their inter-relationships [1]. The knowledge system represented by geospatial knowledge graphs is the key to transitioning from geographic data services to geographic knowledge services. Geospatial knowledge graphs are very helpful for geographic knowledge searches, intelligent geographic knowledge Q&As, geographic knowledge recommendations, and geographic data analysis [2]. In the era of big data, the scale of geospatial knowledge graphs is growing exponentially. For example, the LinkedGeodata dataset [3], first released in 2009, has more than 3 billion entities and 20 billion triples in its latest version. The GeoWordNet dataset [4], released in 2011, has over 3.6 million entities and 50 million triples in its latest version. The WorldKG dataset [2] released in 2021, contains more than 100 million entities and 800 million triples in its latest version. For large-scale geospatial knowledge graphs, standalone management methods can no longer meet current demands [5]. Distributed management is becoming an inevitable requirement. The partitioning of geospatial knowledge graphs is an essential task in large-scale knowledge graph management, which partitions a data graph into several subgraphs which are stored on different distributed cluster nodes [6]. Geospatial knowledge graph partitioning provides the foundational support for the distributed storage, management, querying, and reasoning of geospatial knowledge graphs. When executing a distributed query, the query is first decomposed into several subqueries based on the partitioning result. Then, the subqueries are assigned to various cluster nodes to obtain partial solutions. Finally, the results of the subqueries are connected to produce the final solution. The efficiency of query is highly relevant to the partitioning results. The more partitions passed during the query, the slower the query is.

Currently, knowledge graph partitioning is generally regarded as a graph partitioning problem. The optimization goals of node/edge partitioning strategies are (i) minimizing the number of cross-partition edges (cut edges) or nodes that have been duplicated in different partitions (replicated node) and (ii) ensuring balanced node or edge counts within each partition [7]. Minimizing the number of cut edges or replicated vertices aims to reduce network communication costs between cluster nodes in a distributed environment. Meanwhile, balancing the number of nodes or edges within each partition prevents the formation of computational hotspots and avoids excessive differences in processing capabilities. The graph partitioning problem is a well-known NP-hard problem [8,9], and no polynomial-time algorithm has been found to solve it thus far. Therefore, existing work has focused on finding approximate partitioning solutions rather than optimal ones [10,11]. Currently, according to different graph loading methods, existing graph partitioning algorithms can be categorized into streaming partitioning algorithms and offline partitioning algorithms.

Streaming graph partitioning algorithms load edges or nodes into a data stream according to certain rules (such as random-order, depth-first, or breadth-first traversal) and partition only one node or one edge at a time [12]. The hash partitioning method is the simplest streaming graph partitioning technique. It calculates a hash value based on the reading order or globally unique ID of vertices or edges and assigns vertices or edges with equal hash values to the same partition. Hash-based graph partitioning is nearly a random partitioning method, which leads to significant communication costs [13]. Xie et al. (2014) proposed the DBH (Degree-Based Hashing) algorithm, which improves upon the hash algorithm by prioritizing the partitioning of higher-degree vertices to enhance partitioning quality [14]. Degree-Based Hashing (DBH) is a hashing method that takes edges as input. For the nodes of an edge, it leverages the degree information of two nodes. Then, DBH replicates the node with a higher degree and assigns the edge according to the hash value of the node with a lower degree. However, the partitioning algorithms mentioned above do not consider historical partitioning information and have a certain level of randomness, resulting in a large number of cut edges or replicated vertices. This disadvantage leads to substantial communication costs. To solve this problem, many scholars have taken historical partitioning information into account. These algorithms establish a scoring function between the current data and each previously partitioned subgraph. When a new edge or vertex is loaded into the data stream, the partition is assigned based on the scoring function (e.g., Greedy algorithm [15], FENNEL (streaming edge-cut neural network for efficient graph partitioning) algorithm [16], HDRF (High-Degree Replicated First) algorithm [17] and Skewness-aware Vertex-cut Partitioner [18]). For instance, HDRF is an improved method of DBH. Instead of using the degrees of the nodes, HDRF constructs a score function based on the number of subgraph neighbors in the historical partition and computational load. The input edge will be assigned to the partition with the highest score. However, the above traditional streaming partitioning algorithms neglect the global structure of the graph and may fall into local optima. Researchers have introduced sliding window or buffer techniques to alleviate these limitations. These methods first load the data into a window or buffer to increase the number of candidate nodes/edges in one assignment, and then choose the best node/edge from the candidates for assignment to a partition [19,20,21,22,23]. Streaming graph partitioning algorithms are usually efficient, typically requiring only a single pass over the graph data to complete the partitioning. However, it is hard to capture the global structures of graphs with streaming partitioning algorithms, which usually lead to low-quality partitions [21].

Offline partitioning algorithms load the complete graph data into memory before partitioning. These algorithms can be further categorized into global and local search partitioning algorithms. Global search partitioning algorithms directly consider the global information of the entire graph to produce a partition. For instance, spectral partitioning methods calculate the eigenvector corresponding to the second-smallest eigenvalue of the graph’s Laplacian matrix and recursively bisect the graph until the partitioning goal is achieved [24]. The time complexity of spectral partitioning is too high to partition large-scale graphs. Multilevel partitioning algorithms [25,26,27,28,29,30,31] were further proposed to address this problem. Multilevel partitioning algorithms first coarsen a large graph into a smaller graph. Then, they apply a partitioning strategy (such as spectral partitioning [32], KL algorithm [33], or Greedy Graph Growth algorithm [34]) to the small graph. Finally, the partition is projected back onto the original large graph with further refinement. Local search partitioning algorithms optimize the initial partitioning based on local graph information. These algorithms can be classified into three categories: unit or group movement, label propagation partitioning algorithms, and neighborhood expansion-based partitioning algorithms. Algorithms based on unit or group movement aim to reduce edge cuts by moving units or groups between partitions to optimize the initial partitioning (e.g., KL algorithm [26], FM algorithm [35], and JA-BE-JA-VC algorithm [36]). However, the act of exchanging individual vertices or edges between partitions is prone to getting stuck in local optima [37,38,39,40,41]. To solve this problem, vertex group movement methods have been further developed [37,38]. These methods first detect vertex groups at partition boundaries, and then move these vertex groups based on movement gain and partition load to improve partition quality [40,41]. Label propagation-based partitioning algorithms randomly assign each node a label, and then iteratively update the labels under the constraint of balancing computational load between partitions [42,43]. The label of a node is updated to the most frequent label among its adjacent vertices. This update process continues until there are no label changes and loads of partitions are balanced [44,45,46]. Neighborhood expansion-based partitioning algorithms [27,47,48] first select seed points to form a core set, and then iteratively select appropriate vertices from the boundary set to join the core set. The boundary set is iteratively updated until the threshold of computational load is reached.

Based on the above reviews, one can find that the aforementioned methods only rely on the graph structure for partitioning and neglect the spatial relationship between entities. In the application of geospatial knowledge graphs (e.g., geographic knowledge recommendation and geographic data analysis), a spatial query is frequently required [49,50]. In Figure 1, we use a practical example to show the disadvantage of existing non-spatial graph partitioning methods. Different partitions are labeled by the color of nodes (orange and purple). Spatial queries are represented by orange and blue shadows. When a spatial relationship is neglected (Figure 1a), the two spatial queries need to run across partitions. The network communication between partitions will result in the increase in query response time. In contrast, when a spatial relationship is considered to produce partitions (Figure 1b), the two spatial queries can run in a single partition. To improve the efficiency of a spatial query in large-scale geospatial knowledge graphs, spatial relationships between entities must be considered in the partitioning of geospatial knowledge graphs [50,51]. Existing partitioning methods that disregard spatial relationships may lead to spatially adjacent entities being assigned to different partitions, negatively affecting spatial query performance. To address this issue, we treat the partitioning of geospatial knowledge graphs as a special spatial clustering problem and develop a dual clustering-based method for geospatial knowledge graph partitioning (GKGP-DC). GKGP-DC incorporates the constraints of spatial relationships between entity nodes into the optimization objective of graph partitioning.

2. Methods

GKGP-DC first identifies the geographic entities from a geospatial knowledge graph, and then extracts coordinates from these entities. Then, a spatial clustering method named density peak clustering is applied to obtained spatial clusters. Next, nodes within each spatial cluster are merged into super-nodes to ensure that spatially adjacent nodes can be assigned to the same partition. Subsequently, the graph clustering method named Leiden is used, further consolidating the aggregated graph into a weighted graph. The weighted graph not only accounts for geospatial correlation but also reflects the connectivity between entities in the graph. Finally, the partitions are produced by performing LDG on the weighted graph to balance the computational load of each partition. The overall process of the proposed method is shown in Figure 2.

2.1. Geographic Node Partitioning Based on Density Peak Clustering

To improve the efficiency of a spatial query, it is crucial to assign spatially adjacent nodes to the same partition. Therefore, GKGP-DC uses spatial clustering to pre-analyze the spatial structure of data. To find the natural spatial clusters from geographic nodes, we use a well-established spatial clustering algorithm, i.e., density peak clustering (DPC) [52], to cluster geographic nodes. DPC can naturally identify the clustering structure and automatically determine the number of clusters. The DPC algorithm is based on two fundamental observations: (i) the local density of the cluster center is higher than that of the surrounding samples; (ii) the distance between cluster centers is relatively large. DPC determines the cluster centers by identifying the density peaks in the dataset, and then defines the cluster boundaries based on density and relative distance. Inspired by DPC-KNN [53] and an adaptive cluster center selection method proposed by Sun et al. [54], we partition the geographical nodes in the following two steps:

2.1.1. Definition of Local Density and Relative Distance

As mentioned above, local density and relative distance are the basis of choosing cluster centers and determining the cluster boundaries. In this study, we used the definition of local density proposed in DPC-KNN [53]. The local density of a node is defined as the exponent of the mean square distance between the node itself and its k-nearest neighbors, as follows:

ρ_{i} = \exp (- \frac{1}{k} \sum_{j \in K N N_{i}} d {(i, j)}^{2})

(1)

where

ρ_{i}

represents the local density of node

i

,

k

denotes the number of nearest neighbors,

K N N_{i}

represents the set of the

k

nearest nodes to node

i

, and

d (i, j)

represents the Euclidean distance between node

i

and node

j

. Calculating the local density based on the k-nearest neighbors can assess the local structure of nodes and is helpful for identifying the clusters with different densities [53]. The relative distance is defined as the minimum distance between a node and the set of nodes with higher local density. For the node with the highest local density, the relative distance is defined as the maximum possible distance between two nodes (Equation (2)).

δ_{i} = \{\begin{matrix} m i n_{j : ρ_{i} > ρ_{j}} (d (i, j)) ρ_{i} < \max (ρ) \\ m a x_{j \neq i} (d (i, j)) ρ_{i} = \max (ρ) \end{matrix}

(2)

where

δ_{i}

represents the relative distance of node

i

, and

\max (ρ)

represents the maximum local density.

2.1.2. Cluster Center Selection and Node Agglomeration

To determine the cluster centers, the DPC algorithm typically involves manually selecting the cluster centers according to a decision diagram, as shown in Figure 3. However, this manual selection reduces the efficiency of DPC and introduces subjectivity into the results. Therefore, the strategy proposed by Sun et al. [54] is adopted for determining the cluster centers. First, nodes where both the local density

ρ

and the relative distance

δ

are greater than the remaining 80% are selected as candidate nodes. Then, the candidate nodes are ranked in descending order based on the product value of

δ

and

ρ

(

γ = δ \times ρ

). A linear fit line is drawn according to the

γ

values, and the fitted value

γ ’

is calculated. The difference between

γ

and

γ ’

is obtained for each node (

Δ γ = γ - γ ’

). The nodes with

Δ γ

greater than the average

Δ γ

are treated as candidates. Finally, cluster centers are consecutively selected from the front to the back of the ranked candidate sequence.

After the centers of clusters are determined, non-cluster center nodes are directly assigned to the cluster of the nearest node with a higher density than themselves. To ensure the spatial correlation of geographic nodes, the nodes within each cluster are merged into a super-node. The weight of the super-node is set as the sum of the weights of the nodes contained within the cluster. The edges between a cluster are merged into a single edge, with the edge weight set as the sum of the weights of the edges. For instance, nodes A, B, C, and D are merged into the orange super-node, and node I, J, and K are merged into the purple super-node (Figure 4). The weights of the orange super-node and purple super-node are 4 and 3, respectively. For example, the weight of the edge between the orange super-node and node G is set as the sum of the weights of BG and DG.

2.2. Discovery of Graph Structure Based on Graph Clustering

To avoid tightly connected nodes being assigned to different partitions, we further used a graph clustering method to identify the graph structure from the coarse graph obtained by Section 2.1, which can find the closely connected groups of nodes. Based on the graph structure, the coarse graph can be further comprised. Constructing partitions according to closely connected groups will be helpful for minimizing the number of cut edges or duplicated nodes. In this study, we use the Leiden algorithm [55] on the coarse graph to comprise the clustering result obtained by DPC. The Leiden algorithm was introduced by Traag et al. [55] as an improvement over the Louvain algorithm [56]. The Leiden algorithm is one of the most advanced community detection algorithms in recent years and has been shown to achieve state-of-the-art performance on many graphs [57].

The Leiden algorithm requires a pre-defined function to evaluate the quality of communities. GKGP-DC adopts modularity as the quality function for evaluating communities. Given a weighted graph

G

and a partition

C = {C_{1} C_{2}, \dots C_{n}}

, the modularity is calculated as follows:

Q = \frac{1}{2 m} \sum_{i, j} [w_{i j} - \frac{K_{i} K_{j}}{2 m}] δ (C_{i}, C_{j})

(3)

Based on the definition of quality function of communities and the graph aggregated from the DPC results in Section 2.1, the Leiden algorithm is applied to detect community structures with tight internal connections and loose external connections. The graph is further aggregated based on these community structures. The detailed steps of the algorithm are as follows:

(i) Local Movement of Nodes. Initially, each node is considered as a separate community (Figure 5a). Then, each node is moved to the community that yields the largest modularity gain (Figure 5b). In this phase, the Leiden algorithm uses a fast local movement strategy, only accessing nodes whose neighbors have moved to other partitions. This strategy enables the algorithm to perform local moves quickly and efficiently. The modularity gain from moving node

i

from community

C_{j}

to community

C_{s}

(where j ≠ k) can be calculated as follows:

△ Q_{i} = \frac{1}{2 m} [K_{i, s} - K_{i, j} + \frac{K_{i}}{m} (\sum j, t o t - \sum s, t o t - K_{i})]

(4)

where

K_{i, s}

and

K_{i, j}

represent the total weight of edges between node

i

and nodes in communities

C_{s}

and

C_{j}

, respectively.

\sum j, t o t

and

\sum s, t o t

is the total weight of edges in

C_{s}

and

C_{j}

, respectively.

m

denotes the total weight of all edges.

(ii) Refinement of Partitions. After local movement, each node within a partition is treated as a separate community, and community merging is performed within each partition (Figure 5c). This merging does not necessarily maximize modularity gain but allows the random selection of a community that increases modularity. The larger the modularity increase, the higher the likelihood of selecting that community for merging. The randomness in selecting communities allows a broader exploration of the partitioning space and also accelerates the algorithm.

(iii) Graph Aggregation Based on Refined Partitions. The graph is aggregated based on the refined partitioning results obtained by step (ii). Nodes in each partition are merged into a single node, and the corresponding edges are merged into a single edge (Figure 5d). The node weights are updated as the sum of the weights of the nodes in the partition, and the edge weights are set to the sum of the weights of the parallel edges. Then, step (i) and step (ii) are repeated on the aggregated graph (Figure 5e). The iteration will not stop until modularity can no longer be improved, and the graph will be the final result (Figure 5f).

2.3. Super-Node Partitioning Based on the LDG Algorithm

The geospatial knowledge graph is compressed into a smaller-scale weighted graph after the process of Section 2.1 and Section 2.2. To achieve the optimization goal of load balancing in graph partitioning, the LDG algorithm [12] is applied to partition the weighted graph (Figure 6). The main idea is to construct a scoring function based on the feature of vertices, their neighboring nodes, and the existing partition. Then, the nodes are assigned to the partition with the highest score under the load balancing constraint. The scoring function for a node is defined from two aspects: one is the sum of the edge weights between the node and its neighbors already assigned to the partition, and the other is the capacity constraint term. The calculation formulas are as follows:

g (v, P_{i}) = \sum_{u \in N (v), u \in P_{i}} w (v, u) \times (1 - \frac{|P_{i}|}{C})

(5)

C = β \times \frac{\sum_{x \in V} w (x)}{k}

(6)

where

N (v)

represents the set of neighbors of node

v

, and

w (v, u)

represents the weight between nodes

v

and

u

.

P_{i}

represents the partition, and

C

represents the capacity constraint for the partition.

β

(

β > 1

) is a user-defined parameter.

V

represents the total set of nodes in the graph, and

k

represents the number of partitions.

Since the smaller-scale weighted graph may contain some nodes with larger weights (Figure 6a), the optimization goal of load balancing needs extra attention. In our method, the k nodes with the largest weights are first selected and assigned to different partitions. Then, the remaining nodes are loaded based on a breadth-first search strategy and assigned based on the aforementioned scoring function. The specific implementation process (Figure 6) is shown as follows:

(i) Pre-allocation. As shown in the lower part of Figure 6b, nodes with the largest weight are first selected for each partition to prevent significant load imbalances in later partitioning stages.

(ii) Node Assignment Order. A node is randomly selected from the remaining nodes, and the node partitioning order is determined using the breadth-first search strategy.

(iii) Data Allocation. One node is loaded each time. If the sum of the weight of the current node and the total weight of the current partition does not exceed the capacity constraint

C

defined by Equation (6), the partition is considered a candidate partition. Then, the scoring function described by Equation (5) is used to calculate a score for the node and each candidate partition. The node is assigned to the partition with the highest score (Figure 6b). If multiple partitions have the same maximum score, the partition with the smallest weight is selected. The final result of partitioning is obtained after all nodes are assigned to corresponding partitions (Figure 6c).

3. Experiments

GKGP-DC was compared with ten representative graph partitioning algorithms: Hash, LDG, FENNEL, WStream, HeiStream, IHCGP, METIS, ParMetis, KAFFPA, and XtraPuLP. The experiments were executed on an Ubuntu cluster which was set up using VMWare virtual machines, consisting of one master node and eight worker nodes. Each virtual machine is allocated two processors, 4 GB of memory, and a 50 GB hard drive.

3.1. Dataset

In this study, a portion of the YAGO3 [58] dataset was selected to construct the geospatial knowledge graph, referred to as GeoKG. YAGO3 integrates information from Wikipedia with other data sources such as WordNet and Geonames, and it describes various entities such as people, organizations, cities, and the relationships between them. There are two reasons that YAGO3 is chosen to construct GeoKG. First, YAGO3 is a large-scale knowledge graph, which can highlight the difference in query response time between different partitioning methods. Second, YAGO3 has enough geospatial information to support a spatial–graph coupled query. The process of constructing GeoKG is as follows: The file named “yagoFacts” describes the relationships between all entities in the YAGO3 dataset. Therefore, all data (5,628,166 triples and 37 types of edges) from “yagoFacts” were selected. Then, the geospatial information of entities, such as location, was extracted from the “yagoGeonamesOnlyData” file. The geospatial information was connected with the entities in “yagoFacts”. Additionally, some random entities and their location information were selected from the remaining data in “yagoGeonamesOnlyData”. Finally, the aforementioned triples were merged to build GeoKG, which served as the geospatial knowledge graph for testing (Figure 7). GeoKG describes entities such as people, organizations, events, and places, along with the relationships between them. Entities with spatial location information are defined as geographic entity nodes. As shown in Table 1, GeoKG is a large-scale geographic spatial knowledge graph which has 3312247 nodes and 7022211 edges. There are approximately 12% (384981) geographic nodes in GeoKG, which are distributed globally (Figure 8). GeoKG contains 65 types of edges, with 23 types of edges related to geographic nodes.

3.2. Query Design

To validate the effectiveness of GKGP-DC, four types of spatial queries were designed for testing: withinDistance queries, buffer queries, k nearest neighbor (kNN) queries, and intersects queries. These four types of spatial queries are widely used in the field of geospatial data processing and are the basis of complex spatial queries [45]. Therefore, we believe that these four types of queries can comprehensively evaluate the performance of the proposed algorithm in practical applications. In addition, coupled queries that combine spatial and graph queries are an inevitable requirement in geospatial knowledge graph query scenarios. Therefore, the four spatial queries mentioned above were combined with four types of graph queries (Figure 9) to construct various spatial–graph coupled queries.

As shown in Figure 9a, the star queries were designed for obtaining the first neighbors of the node. Snowflake queries consist of multiple star queries. Figure 9b shows a snowflake query consisting of two star queries. The chain queries are step-by-step queries, which obtain only one neighbor according to the label of edge at each step (Figure 9c). The complex queries are composed of multiple basic queries. For example, Figure 9d shows a complex query consisting of two star queries and a chain query. In the experiments of spatial–graph coupled query, the spatial query was first applied to the geospatial knowledge graph. Based on the results of the spatial query, the graph query was further used to obtain the final results.

3.3. Results

3.3.1. Spatial Query Results

For withinDistance queries, buffer queries, kNN queries, and intersect queries, 20 geographic nodes were randomly selected as query inputs. The average number of computational nodes required to execute these queries (rounded up) was calculated for each partitioning algorithm. As shown in Table 2 and Figure 10, GKGP-DC required fewer computational nodes for all types of queries compared to other methods. Specifically, for withinDistance queries and buffer queries, the proposed method only needed three and two computational nodes, respectively. In contrast, algorithms such as Hash, LDG, FENNEL, WStream, and HeiStream required seven to eight computational nodes for most queries. GKGP-DC requires significantly fewer nodes than other algorithms, indicating a lower resource demand. Offline partitioning algorithms like METIS, ParMetis, KAFFPA, and XtraPuLP showed relatively lower computational node requirements for certain queries. For example, ParMetis needed only five computational nodes for kNN queries, which can be attributed to the fact that offline partitioning algorithms focus on minimizing edge cuts. As a result, proposed method can group nodes that are spatially close and tightly connected by edges into the same partition, reducing the number of computational nodes required for some queries.

The above experimental results can be explained as follows. Traditional graph partitioning methods do not consider spatial correlation; therefore, the spatial neighboring nodes are randomly distributed in different partitions. As a result, traditional graph partitioning methods require more computational nodes when spatial queries are executed. In contrast, the proposed method in this study takes spatial correlation into account, prioritizing the partitioning of geographically close nodes into the same partition. As a result, the number of computational nodes required for spatial queries is significantly reduced, which decreases dependency on computational resources, improves query response times, and enhances overall system performance.

3.3.2. Spatial–Graph Coupled Query Experimental Results

For withinDistance–graph coupled queries, GKGP-DC exhibits significant performance improvements (Table 3, Figure 11). For a withinDistance–star query, the proposed algorithm reduces the query response time by at least 0.868 s and up to 2.102 s. For a withinDistance–snowflake query, the reduction is at least 4.786 s and up to 15.325 s. The proposed method has a significant improvement on a withinDistance–chain query and a withinDistance–complex query, whose response time is reduced by at least 9.098 s and up to 23.053 s and at least 10.429 s and up to 22.260 s, respectively.

For buffer–graph coupled queries, the proposed algorithm also performs better compared to other partitioning methods (Table 4, Figure 12). Specifically, for a buffer–star query, the proposed algorithm reduces the query response time by at least 1.797 s and up to 2.487 s. For a buffer–snowflake query, the reduction is at least 4.907 s and up to 19.571 s. As for buffer–chain queries and buffer–complex queries, the response time is reduced by at least 7.162 s and up to 19.506 s and at least 10.756 s and up to 22.774 s, respectively.

The result of a kNN–graph coupled query also shows that GKGP-DC outperforms the comparison methods (Table 5, Figure 13). For example, the proposed algorithm reduces the query response time by at least 0.586 s and up to 0.886 s on a kNN–star query. For a kNN–snowflake query, the reduction is at least 4.992 s and up to 12.360 s. For a kNN–chain query, the response time is reduced by at least 5.172 s and up to 13.258 s. And for a kNN–complex query, the response time is reduced by at least 4.248 s and up to 16.845 s.

In addition, Table 6 and Figure 14 show that GKGP-DC exhibits significant performance improvements when handling different intersect–graph coupled queries. Specifically, for an intersect–star query, the proposed algorithm reduces the query response time by at least 0.472 s and up to 3.078 s. For an intersect–snowflake query, the reduction is at least 2.613 s and up to 10.772 s. For an intersect–chain query, the response time is reduced by at least 2.455 s and up to 11.819 s. And for an intersect–complex query, the response time is reduced by at least 4.039 s and up to 19.744 s.

3.4. Discussion

The experiment results (Table 2 and Figure 10) show that GKGP-DC outperforms the comparison methods on spatial-related queries. The number of computational nodes traversed is at least three less for withinDistance and buffer queries and at least three less for kNN queries and intersect queries. In addition, the improvement is also speciously obvious in spatial–complex coupled queries. In all experimental results of spatial–graph coupled queries, the response time is reduced by approximately 76% at most (result of intersect–complex queries, Table 6 and Figure 14). The main reasons for this can be summarized as follows:

(i) GKGP-DC preprocesses geographic nodes using density peak clustering, effectively grouping spatially proximate nodes together. This strategy significantly optimizes the execution efficiency of spatial queries. As for spatial–graph coupled queries, the results of spatial subqueries are confined to only a few partitions. This not only significantly reduces the assembly time of the sub-results, but also decreases the number of results generated by the factorized queries. Furthermore, the network transmission consumption is reduced. Overall, the partitions produced by GKGP-DC effectively reduce the total query response time for spatial-related queries.

(ii) The preprocessing of geographic nodes may result in inferior partition quality compared to the results obtained by high-quality offline partitioning algorithms (such as Metis). However, GKGP-DC mitigates this issue by using the Leiden algorithm to identify graph structures. The graph clusters detected by Leiden are treated as the basic units for graph partitioning, which can preserve the structural characteristics of the graph. It is helpful for reducing the number of cross-partition edges. Therefore, the weighted graph produced after applying DPC and Leiden contains information on both spatial relationships and graph structures. The efficiency of both spatial queries and graph queries can benefit from this partitioning scheme.

The significant reductions in query response time and computational node usage show that GKGP-DC can optimize the performance of spatial queries and spatial–graph coupled queries. For a system requiring the distributed management of large-scale geospatial knowledge graphs, such as a geographic intelligent question-answering platform, GKGP-DC can improve system operation efficiency.

4. Conclusions

Geospatial knowledge graphs contain rich geographic location semantics, requiring graph partitioning algorithms to comprehensively consider both spatial relationships and graph structures. Traditional graph partitioning methods discard the spatial relationship between entities, resulting in poor performance for spatial queries. To solve this problem, we propose GKGP-DC to produce spatially aware partitions. By using density peak clustering to aggregate spatial entities to super-nodes, GKGP-DC avoids dividing close spatial entities into different partitions. Then, the Leiden algorithm is applied on the coarse graph, producing clusters which meet the requirement of minimizing cut edges. Finally, the LDG algorithm is used to refine the partitioning result, ensuring load balance without altering the clusters. Therefore, the final partitioning result can be obtained under three constraints: spatial relationship, graph structure, and load balance. Experiments on a geospatial knowledge graph (i.e., GeoKG) show that compared to ten representative methods, GKGP-DC can effectively reduce the number of computational node resources required for spatial queries and obviously improve the response time for spatial–graph coupled queries. GKGP-DC can be further used for the distributed management of large-scale geospatial knowledge graphs.

Future studies should focus on these two aspects. First, GKGP-DC lacks the ability of processing spatial relationships between entities with different geometric types, such as points and polygons. GKGP-DC treats all geographic entities as points. The centroids of entities were used to represent the locations of entities. It is possible that a point entity is in a polygon entity, but the distance between the point and the centroid of the polygon is large. In this situation, the point may be assigned to a different partition, resulting in cross-partition communication and the low efficiency of spatial queries. When searching the k-nearest neighbors for DPC, taking the spatial relationship into account may address this issue. For example, if a polygon entity contains a point entity, they should always be considered as neighbors. In addition, GKGP-DC can only partition geospatial knowledge graphs based on the spatial relationship and graph structure. In practice, user preference also should be considered to partition the graph. One way to alleviate this problem is to adjust the partition based on the history records of users. We leave this issue for further research.

Author Contributions

Conceptualization, Y.C. and K.C.; data curation, F.O., Y.C. and M.C.; methodology, Y.C., K.C. and Q.L.; visualization, Y.C. and F.O.; writing—original draft, Y.C., K.C. and G.W.; writing—review and editing, Q.L., M.D. and R.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded through support from the National Key Research and Development Program of China (No. 2021YFB3900903), the National Natural Science Foundation of China (No. 42271484), the water conservancy science and technology project of Guizhou, China (No. KT202410), and the Research Foundation of the Department of Natural Resources of Hunan Province (Grant No. HBZ20240132).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author(s).

Acknowledgments

We deeply appreciate the anonymous reviewers for their constructive and helpful suggestions.

Conflicts of Interest

Rui Xu was employed by the company Guizhou Survey & Design Research Institute for Water Resources and Hydropower. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Varanka, D.E. A geospatial knowledge graph prototype for national topographic mapping. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 48, 511–516. [Google Scholar] [CrossRef]
Dsouza, A.; Tempelmeier, N.; Yu, R.; Gottschalk, S.; Demidova, E. Worldkg: A world-scale geographic knowledge graph. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Queensland, Australia, 1–5 November 2021; pp. 4475–4484. [Google Scholar]
Auer, S.; Lehmann, J.; Hellmann, S. Linkedgeodata: Adding a spatial dimension to the web of data. In Proceedings of the Semantic Web-ISWC 2009: 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, 25–29 October 2009; Proceedings 8. pp. 731–746. [Google Scholar]
Giunchiglia, F.; Maltese, V.; Farazi, F.; Dutta, B. GeoWordNet: A resource for geo-spatial applications. In Proceedings of the Semantic Web: Research and Applications: 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Greece, 30 May–3 June 2010; Proceedings, Part I 7. pp. 121–136. [Google Scholar]
Qi, Z.; Wang, H.; Zhang, H. A dual-store structure for knowledge graphs. IEEE Trans. Knowl. Data Eng. 2023, 35, 1104–1108. [Google Scholar] [CrossRef]
Heidari, S.; Simmhan, Y.; Calheiros, R.N.; Buyya, R. Scalable graph processing frameworks: A taxonomy and open challenges. ACM Comput. Surv. (CSUR) 2018, 51, 1–53. [Google Scholar] [CrossRef]
Rahimian, F.; Payberah, A.H.; Girdzijauskas, S.; Haridi, S. A distributed algorithm for large-scale graph partitioning. ACM Trans. Auton. Adapt. Syst. 2015, 10, 1–24. [Google Scholar] [CrossRef]
Garey, M.R.; Johnson, D.S.; Stockmeyer, L. Some simplified NP-complete problems. Theor. Comput. Sci. 1974, 1, 237–267. [Google Scholar] [CrossRef]
Ding, L.; Li, C.; Jin, D.; Ding, S. Survey of spectral clustering based on graph theory. Pattern Recognit. 2024, 151, 110366. [Google Scholar] [CrossRef]
Kolountzakis, M.N.; Miller, G.L.; Peng, R.; Tsourakakis, C.E. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 2012, 8, 161–185. [Google Scholar] [CrossRef]
Camilus, K.S.; Govindan, V.K. A review on graph based segmentation. Int. J. Image Graph. Signal Process. 2012, 4, 1. [Google Scholar] [CrossRef]
Stanton, I.; Kliot, G. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 1222–1230. [Google Scholar]
Malewicz, G.; Austern, M.H.; Bik, A.J.; Dehnert, J.C.; Horn, I.; Leiser, N.; Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 135–146. [Google Scholar]
Xie, C.; Yan, L.; Li, W.J.; Zhang, Z. Distributed power-law graph computing: Theoretical and empirical analysis. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1673–1681. [Google Scholar]
Gonzalez, J.E.; Low, Y.; Gu, H.; Bickson, D.; Guestrin, C. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation, Hollywood, CA, USA, 8–10 October 2012; pp. 17–30. [Google Scholar]
Tsourakakis, C.; Gkantsidis, C.; Radunovic, B.; Vojnovic, M. FENNEL: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA, 24–28 February 2014; pp. 333–342. [Google Scholar]
Petroni, F.; Querzoni, L.; Daudjee, K.; Kamali, S.; Iacoboni, G. HDRF: Stream-based partitioning for power-law graphs. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 243–252. [Google Scholar]
Ding, Z.; Xiang, Y.; Wang, S.; Xie, X.; Zhou, S.K. Play like a vertex: A stackelberg game approach for streaming graph partitioning. Proc. ACM Manag. Data 2024, 2, 1–27. [Google Scholar] [CrossRef]
Mayer, C.; Mayer, R.; Tariq, M.A.; Geppert, H.; Laich, L.; Rieger, L.; Rothermel, K. ADWISE: Adaptive window-based streaming edge partitioning for high-speed graph processing. In Proceedings of the 38th IEEE International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2–6 July 2018; pp. 685–695. [Google Scholar]
Patwary, M.A.K.; Garg, S.; Kang, B. Window-based streaming graph partitioning algorithm. In Proceedings of the Australasian Computer Science Week Multiconference, Sydney, NSW, Australia, 29–31 January 2019; pp. 1–10. [Google Scholar]
Li, Y.; Li, C.; Orgerie, A.C.; Parvédy, P.R. WSGP: A window-based streaming graph partitioning approach. In Proceedings of the 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Melbourne, Australia, 10–13 May 2021; pp. 586–595. [Google Scholar]
Faraj, M.F.; Schulz, C. Buffered streaming graph partitioning. ACM J. Exp. Algorithmics 2022, 27, 1–26. [Google Scholar] [CrossRef]
Wang, Z.; Yang, Z.; Wang, N.; Du, Y.; Nie, J.; Wei, Z.; Yu, G. Lightweight streaming graph partitioning by fully utilizing knowledge from local view. In Proceedings of the 43rd International Conference on Distributed Computing Systems, Hong Kong, China, 18–21 July 2023; pp. 614–625. [Google Scholar]
Fiedler, M. Algebraic connectivity of graphs. Czechoslov. Math. J. 1973, 23, 298–305. [Google Scholar] [CrossRef]
Karypis, G.; Kumar, V. Analysis of multilevel graph partitioning. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, San Diego, CA, USA, 8 December 1995; p. 29-es. [Google Scholar]
Karypis, G.; Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 1998, 20, 359–392. [Google Scholar] [CrossRef]
Holtgrewe, M.; Sanders, P.; Schulz, C. Engineering a scalable high-quality graph partitioner. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Atlanta, GA, USA, 19–23 April 2010; pp. 1–12. [Google Scholar]
Sanders, P.; Schulz, C. Engineering multilevel graph partitioning algorithms. In Proceedings of the 19th Annual European Symposium on Algorithms, Saarbrücken, Germany, 5–9 September 2011; pp. 469–480. [Google Scholar]
Sanders, P.; Schulz, C. Distributed evolutionary graph partitioning. In Proceedings of the 14th Workshop on Algorithm Engineering and Experiments, Kyoto, Japan, 16 January 2012; pp. 16–29. [Google Scholar]
Meyerhenke, H.; Sanders, P.; Schulz, C. Parallel graph partitioning for complex networks. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 2625–2638. [Google Scholar] [CrossRef]
Jafari, N.; Selvitopi, O.; Aykanat, C. Fast shared-memory streaming multilevel graph partitioning. J. Parallel Distrib. Comput. 2021, 147, 140–151. [Google Scholar] [CrossRef]
Chan, P.K.; Schlag, M.D.F.; Zien, J.Y. Spectral k-way ratio-cut partitioning and clustering. In Proceedings of the 30th International Design Automation Conference, Dallas, TX, USA, 14–18 June 1993; pp. 749–754. [Google Scholar]
Kernighan, B.W.; Lin, S. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 1970, 49, 291–307. [Google Scholar] [CrossRef]
Predari, M.; Esnard, A. A k-way greedy graph partitioning with initial fixed vertices for parallel applications. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Heraklion, Greece, 17–19 February 2016; pp. 280–287. [Google Scholar]
Fiduccia, C.M.; Mattheyses, R.M. A linear-time heuristic for improving network partitions. In Proceedings of the 19th Design Automation Conference, Las Vegas, NV, USA, 14–16 June 1982; pp. 175–181. [Google Scholar]
Rahimian, F.; Payberah, A.H.; Girdzijauskas, S.; Haridi, S. Distributed vertex-cut partitioning. In Proceedings of the 14th IFIP International Conference on Distributed Applications and Interoperable System, Berlin, Germany, 3–6 June 2014; pp. 186–200. [Google Scholar]
Li, H.; Yuan, H.; Huang, J.; Cui, J.; Yoo, J. Dynamic graph repartitioning: From single vertex to vertex group. In Proceedings of the 25th International Conference on Database Systems for Advanced Applications, Jeju, Republic of Korea, 24–27 September 2020; pp. 482–497. [Google Scholar]
Li, H.; Liu, Y.; Yang, S.; Lin, Y.; Yang, Y.; Yoo, J. An improved hill climbing algorithm for graph partitioning. Comput. J. 2023, 66, 1176–1761. [Google Scholar] [CrossRef]
Mayer, C.; Tariq, M.A.; Mayer, R.; Rothermel, K. Graph: Traffic-aware graph processing. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 1289–1302. [Google Scholar] [CrossRef]
Li, H.; Yuan, H.; Huang, J.; Ma, X.; Cui, J.; Yoo, J. Edge repartitioning via structure-aware group migration. IEEE Trans. Comput. Soc. Syst. 2021, 9, 751–760. [Google Scholar] [CrossRef]
Li, H.; Yuan, H.; Huang, J.; Cui, J.; Ma, X.; Wang, S.; Philip, S.Y. Group reassignment for dynamic edge partitioning. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2477–2490. [Google Scholar] [CrossRef]
Ugander, J.; Backstrom, L. Balanced label propagation for partitioning massive graphs. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, Rome, Italy, 4–8 February 2013; pp. 507–516. [Google Scholar]
Vaquero, L.; Cuadrado, F.; Logothetis, D.; Martella, C. Adaptive partitioning for large-scale dynamic graphs. In Proceedings of the 4th Annual Symposium on Cloud Computing, Sydney, Australia, 3–5 December 2014; pp. 144–153. [Google Scholar]
Martella, C.; Logothetis, D.; Loukas, A.; Siganos, G. Spinner: Scalable graph partitioning in the cloud. In Proceedings of the 33rd IEEE International Conference on Data Engineering, San Diego, CA, USA, 19–22 April 2017; pp. 1083–1094. [Google Scholar]
Slota, G.M.; Rajamanickam, S.; Madduri, K. PuLP/XtraPuLP: Partitioning Tools for Extreme-Scale Graphs; Technical report; Sandia National Lab. (SNL-NM): Albuquerque, NM, USA, 2017. [Google Scholar]
Moussawi, A.E.; Seghouani, N.B.; Bugiotti, F. A graph partitioning algorithm for edge or vertex balance. In Proceedings of the 31st International Conference on Database and Expert Systems Applications, Bratislava, Slovakia, 14–17 September 2020; pp. 23–37. [Google Scholar]
Zhang, C.; Wei, F.; Liu, Q.; Tang, Z.G.; Li, Z. Graph edge partitioning via neighborhood heuristic. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 605–614. [Google Scholar]
Hanai, M.; Suzumura, T.; Tan, W.J.; Liu, E.; Theodoropoulos, G.; Cai, W. Distributed edge partitioning for trillion-edge graphs. VLDB Endow. 2019, 12, 2379–2392. [Google Scholar] [CrossRef]
Lee, K.; Ganti, R.K.; Srivatsa, M.; Liu, L. Efficient spatial query processing for big data. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, TX, USA, 4–7 November 2014; pp. 469–472. [Google Scholar]
García-García, F.; Corral, A.; Iribarne, L.; Vassilakopoulos, M.; Manolopoulos, Y. Efficient distance join query processing in distributed spatial data management systems. Inf. Sci. 2020, 512, 985–1008. [Google Scholar] [CrossRef]
Zhong, Y.; Han, J.; Zhang, T.; Li, Z.; Fang, J.; Chen, G. Towards parallel spatial query processing for big spatial data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Shanghai, China, 21–25 May 2012; pp. 2085–2094. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed]
Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
Sun, L.; Ye, T.; Sun, J.; Duan, X.; Luo, Y. Density-peak-based overlapping community detection algorithm. IEEE Trans. Comput. Soc. Syst. 2021, 9, 1211–1223. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, 10008. [Google Scholar] [CrossRef]
Anuar, S.H.H.; Abas, Z.A.; Yunos, N.M.; Zaki, N.H.M.; Hashim, N.A.; Mokhtar, M.F.; Nizam, A.F. Comparison between Louvain and Leiden algorithm for network structure: A review. J. Phys. Conf. Ser. 2021, 2129, 012028. [Google Scholar] [CrossRef]
Mahdisoltani, F.; Biega, J.; Suchanek, F.M. Yago3: A knowledge base from multilingual Wikipedias. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013. [Google Scholar]

Figure 1. Comparison of partitioning methods when ignoring and considering spatial information.

Figure 2. Flowchart of GKGP-DC.

Figure 3. Decision diagram of DPC.

Figure 4. Node agglomeration based on DPC.

Figure 5. Graph structure discovery.

Figure 6. Super-node partitioning based on LDG.

Figure 7. Part of GeoKG.

Figure 8. Distribution of geographic entities.

Figure 9. Four types of graph query.

Figure 10. The number of computational nodes for different queries (unit: count).

Figure 11. Response time of withinDistance–graph coupled query (unit: seconds).

Figure 12. Response time of buffer–graph coupled query (unit: seconds).

Figure 13. Response time of buffer–graph coupled query (unit: seconds).

Figure 14. Response time of intersect–graph coupled query (unit: seconds).

Table 1. Data description of GeoKG.

Dataset	Number of Nodes	Number of Edges	Number of Geographic Nodes	Number of Edge Labels
GeoKG	3,312,247	7,022,211	384,981	65

Table 2. The number of computational nodes required for different queries (unit: count).

Algorithm	WithinDistance Query	Buffer Query	kNN Query	Intersect Query
Hash	8	7	8	8
LDG	8	8	8	7
FENNEL	7	7	7	7
WStream	8	8	8	8
HeiSream	6	6	7	6
IHCGP	7	6	7	7
METIS	7	7	6	6
ParMetis	6	6	5	8
KAFFPA	6	5	6	6
XtraPuLP	7	6	5	7
GKGP-DC	3	2	3	4

Table 3. Response time of withinDistance–graph coupled query (unit: seconds).

Algorithm	withinDistance– Star Query	withinDistance– Snowflake Query	withinDistance– Chain Query	withinDistance– Complex Query
hash	5.739	26.248	36.128	38.154
LDG	5.821	23.245	35.759	38.047
FENNEL	4.587	24.056	34.802	37.095
wstream	5.407	22.108	33.252	37.165
HeiStream	5.424	21.236	33.285	37.043
IHCGP	5.705	19.094	32.908	37.094
METIS	5.043	16.246	22.571	26.981
parmetis	5.103	15.709	22.545	26.323
KaFFPa	5.152	15.745	22.173	26.584
xtraPuLP	4.648	15.879	22.556	26.427
GKGP-DC	3.719	10.923	13.075	15.894

Table 4. Response time of buffer–graph coupled query (unit: seconds).

Algorithm	Buffer–Star Query	Buffer–Snowflake Query	Buffer–Chain Query	Buffer–Complex Query
hash	4.045	28.699	32.367	35.879
LDG	4.630	24.158	29.410	33.543
Fennel	4.015	24.537	29.072	33.109
wstream	4.403	23.789	29.328	33.132
HeiStream	4.352	21.585	29.091	33.096
IHCGP	4.705	20.956	27.923	30.745
METIS	4.043	15.114	21.518	24.511
parmetis	4.233	15.393	21.531	24.462
KaFFPa	4.192	14.035	20.023	23.861
xtraPuLP	4.048	14.752	21.479	24.533
GKGP-DC	2.218	9.128	12.861	13.105

Table 5. Response time of kNN coupled query (unit: seconds).

Algorithm	kNN–Star Query	kNN–Snowflake Query	kNN–Chain Query	kNN–Complex Query
hash	2.142	22.421	21.31	25.994
LDG	2.287	22.204	20.156	25.754
FENNEL	2.079	21.312	20.015	24.695
wstream	2.260	22.184	19.986	23.624
HeiStream	1.987	21.661	19.084	23.345
IHCGP	2.076	19.273	18.608	20.309
METIS	2.054	15.94	14.584	13.397
parmetis	1.993	15.702	14.987	14.082
KaFFPa	2.121	15.474	13.224	13.599
xtraPuLP	2.018	15.053	14.431	13.48
GKGP-DC	1.401	10.061	8.052	9.149

Table 6. Response time of intersect–graph coupled query (unit: seconds).

Algorithm	Intersect–Star Query	Intersect–Snowflake Query	Intersect–Chain Query	Intersect–Complex Query
hash	4.202	16.235	16.146	25.992
LDG	4.564	16.209	15.11	23.415
FENNEL	4.697	15.776	15.837	23.639
wstream	4.339	15.753	15.401	22.177
HeiStream	3.987	15.067	14.945	21.046
IHCGP	3.745	13.268	14.515	21.768
METIS	2.093	8.388	7.573	10.639
parmetis	2.122	8.171	7.544	10.287
KaFFPa	2.091	8.076	6.782	10.334
xtraPuLP	2.146	8.562	6.886	10.639
GKGP-DC	1.619	5.463	4.327	6.248

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Ou, F.; Liu, Q.; Wu, G.; Chen, K.; Deng, M.; Chen, M.; Xu, R. Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning. Appl. Sci. 2024, 14, 10704. https://doi.org/10.3390/app142210704

AMA Style

Chen Y, Ou F, Liu Q, Wu G, Chen K, Deng M, Chen M, Xu R. Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning. Applied Sciences. 2024; 14(22):10704. https://doi.org/10.3390/app142210704

Chicago/Turabian Style

Chen, Yuxuan, Feifei Ou, Qiliang Liu, Gusheng Wu, Kaiqi Chen, Min Deng, Meihua Chen, and Rui Xu. 2024. "Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning" Applied Sciences 14, no. 22: 10704. https://doi.org/10.3390/app142210704

APA Style

Chen, Y., Ou, F., Liu, Q., Wu, G., Chen, K., Deng, M., Chen, M., & Xu, R. (2024). Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning. Applied Sciences, 14(22), 10704. https://doi.org/10.3390/app142210704

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual Clustering-Based Method for Geospatial Knowledge Graph Partitioning

Abstract

1. Introduction

2. Methods

2.1. Geographic Node Partitioning Based on Density Peak Clustering

2.1.1. Definition of Local Density and Relative Distance

2.1.2. Cluster Center Selection and Node Agglomeration

2.2. Discovery of Graph Structure Based on Graph Clustering

2.3. Super-Node Partitioning Based on the LDG Algorithm

3. Experiments

3.1. Dataset

3.2. Query Design

3.3. Results

3.3.1. Spatial Query Results

3.3.2. Spatial–Graph Coupled Query Experimental Results

3.4. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI