1. Introduction
The Knowledge Graph (KG), as a structured form of knowledge, plays a pivotal role in enabling semantic interoperability across multi-source heterogeneous data [
1,
2], and has demonstrated significant capabilities in various artificial intelligence applications [
3,
4,
5]. In recent years, the geographic knowledge graph (GeoKG) has been proposed, which organizes, links, and infers geospatial knowledge, and serves various geographical artificial intelligence (GeoAI) applications, such as geographic spatiotemporal question answering systems [
6], economic indicator prediction [
7], weather prediction [
8], traffic forecasting [
9], human activity trajectory mining [
10], point of interest (POI) recommendation [
11], geographic entity retrieval [
12], and urban functional area detection [
13].
The previous GeoKGs can be categorized into three types based on the differences in data sources used during the construction process. A detailed comparison of these GeoKGs is provided in
Table 1.
(1) GeoKGs based on general encyclopedias: These GeoKGs obtain geographic items from large-scale general-purpose internet encyclopedia data, such as YAGO [
26], Wikidata [
27], and Freebase [
28]. They are rich in attribute information and provide common sense geographic knowledge. However, the geographic entities are sparsely distributed, with a lack of spatial relations between entities. Additionally, the coverage of geographic entity types and regions is limited, making it difficult to comprehensively represent geospatial semantics.
(2) GeoKGs extracted from geographic texts: These GeoKGs focus on specialized geographic concepts and the interactions between geographic features, such as GeoKG [
23], GEKG [
24], and AugKG [
25]. This type of GeoKG offers in-depth theoretical support and covers semantic relationships found in geographic texts, making it useful for research and applications in specific fields. However, due to limitations in data acquisition and coverage, these GeoKGs tend to have a small number of items and limited coverage of entity types and regions, making it challenging to meet broader geographic knowledge demands.
(3) GeoKGs based on OpenStreetMap (OSM): These GeoKGs rely on abundant open geographic information resources, such as LinkedGeoData [
19], CrowdGeoKG [
20], and WorldKG [
22], which cover a wide range of geographic entities and attribute information. They excel in terms of geographic entity coverage and the richness of attribute information, yet they still fall short in terms of relationships between geographic entities and the representation of spatial semantics, lacking the comprehensive modeling of spatial relationships and hierarchical structures.
Overall, existing GeoKGs, to varying extents, cover certain aspects of geospatial semantic information, and each type of GeoKG has its strengths in representing geographic knowledge. However, these models still suffer from limited geographic entity coverage, insufficient attribute information, and a lack of spatial relationships. As a result, they fail to comprehensively model key geographic semantics, hindering the effective utilization and representation of the rich semantics and prominent patterns in geographic knowledge.
This paper proposes a hierarchical GeoKG (HGeoKG) that encompasses most types of geographic entities and relationships with rich attribute information. HGeoKG can be used to evaluate and advance geographic knowledge embedding techniques, and thus more effectively supports downstream GeoAI applications.
The contributions of this study are as follows:
1. This paper proposes the concept of HGeoKG, the first geographic knowledge graph that integrates rich attributes, spatial relationships, and regional hierarchical semantics, thereby providing a comprehensive representation of geographic knowledge.
2. This paper proposed a method for constructing HGeoKG and presented the dataset named HGeoKG-MHT-670K. Through statistical analysis of this dataset, we revealed significant regional heterogeneity and long-tail distribution patterns, providing valuable insights into the intrinsic structure and distribution characteristics of GeoKGs.
3. This paper conducted extensive knowledge graph reasoning experiments on HGeoKG-MHT-670K. The experimental results indicate that the regional heterogeneity of the dataset poses challenges for Knowledge Graph Embedding (KGE) models to achieve consistent performance across all regions, highlighting the necessity for differentiated modeling strategies tailored to regional differences. Additionally, the geographic long-tail distribution pattern leads to a decline in embedding quality when handling low-popularity entities, underscoring the urgent need to enhance model capabilities in managing such data. This study provides strong empirical support for the further optimization and application of GeoKGs.
2. Related Work
In this study, a new classification framework for GeoKGs is presented, based on the geographic data sources they use: GeoKGs based on internet encyclopedias, GeoKGs extracted from geographic texts, and GeoKGs based on OSM.
2.1. Internet Encyclopedias-Based GeoKGs
With the development of large-scale general-purpose internet encyclopedic data, some studies have highlighted the rich geographic semantic information embedded in general encyclopedic data, such as geographic entities and spatial location information. These GeoKGs are derived from subsets of large general knowledge bases, including YAGO [
26], Wikidata [
27], and Freebase [
28], which contain geographic knowledge. For the representation of geospatial data, DBpedia [
29] offers latitude and longitude values for various geographic entities. YAGO2 [
14], Clinga [
15], and NCGKB [
16] are knowledge bases with human geography knowledge derived from Wikipedia, Baidu Baike, and Chinese Wikipedia, respectively. Additionally, GeoKG [
18] incorporates vector geographic datasets into Baidu Baike, adding precise coordinates and spatial relationships to the general GeoKG. YAGO2geo [
17], based on YAGO2 and reference geospatial datasets such as Greek administrative geography (GAG), the global administrative areas database (GADM), and OSM, focuses primarily on administrative regions and reuses existing ontologies from the GAG dataset, leading to limited coverage of geographic entity types.
2.2. Geographic Text-Based GeoKGs
Geographic semantic information in geographic knowledge is complex and diverse. Specialized geographic texts encompass detailed semantic information regarding the interactions between geographic entities. Some studies, leveraging these semantic characteristics, have designed conceptual models of GeoKGs that theoretically represent geographic knowledge more effectively. Among these, GeoKG [
23] is a formalized representation of geographic knowledge, extending Attribute Language with Complements (ALC) description logic. It focuses on spatiotemporal knowledge, using entity states to represent changes in each geographic object. Zheng proposed a Geographic Evolution Knowledge Graph (GEKG), which is based on spatiotemporal processes and establishes a hierarchical cube model structure [
24]. AugGKG [
25], an augmented GeoKG, utilizes the GeoSOT global subdivision grid model and time-slice subgraph architecture to discretize and normalize spatiotemporal data within the knowledge graph. These models, through case studies based on knowledge graph queries, have demonstrated their capability to represent the spatiotemporal characteristics of geographic knowledge.
2.3. OpenStreetMap-Based GeoKGs
OSM is a rich source of open geographic information, encompassing a vast array of geographic entities. The representation of these entities (e.g., buildings, mountains, rivers) is characterized by high heterogeneity, diversity, and incompleteness. With the growth of large-scale open crowdsourced geographic data like OSM, some research has focused on utilizing the geographic information from OSM to construct GeoKGs.
Early studies developed ontologies suited to the structure of OSM data: OSMOnto [
21] describes an ontology for OSM tags (e.g., (building, yes)), representing a class hierarchy extracted from OSM keys and values. OSM Semantic Network [
30] contains RDF triples extracted from OSM tags available on the OSM Wiki website. Although OSMOnto and the OSM Semantic Network extracted a significant number of concepts, they did not include any geographic entity instances. Subsequently, LinkedGeoData [
19] converted OSM data into an RDF knowledge graph. This is based on a formal ontology created using OSM tags and keys, offering simplified mappings between OSM data and classes and attributes from other data sources. CrowdGeoKG [
20] extracted different types of entities from OSM and enriched them with human geographic knowledge from Wikidata. WORLDKG [
22], by analyzing a large set of heterogeneous OSM data tags, distilled a class hierarchy of OSM elements. After a degree of manual filtering, geographic entities in OSM were classified into a top-down hierarchical structure, covering various geographic categories and linking geographic entities to specific classes in Wikidata and the DBpedia ontology. However, WORLDKG mainly utilizes point-type entities from OSM and their attributes to construct GeoKGs, lacking coverage of other geographic entity types such as line and polygon entities.
In summary, the existing GeoKG has limitations in several key areas. First, GeoKG based on internet encyclopedias has deficiencies in geographic entity types and spatial coverage, typically including only common attributes and lacking in-depth descriptions of geographically specific attributes. Second, the conceptual model of GeoKG extracted from geographic texts requires high precision and a breadth of geographic data, which is often dispersed across specialized texts in the geographic domain. These texts contain fewer items, and the extraction of entities and relationships is difficult, greatly limiting its scalability and applicability. Furthermore, some knowledge graphs have incomplete coverage of spatial types, typically supporting only one or two spatial types, such as points, lines, or polygons. Additionally, most knowledge graphs do not explicitly model spatial relationships, with only a few providing basic adjacency relations and lacking support for complex spatial relationships, such as inclusion or intersection. Finally, many GeoKGs fail to effectively model hierarchical relationships between geographic entities (e.g., the hierarchical structure of administrative divisions), which limits their performance in applications requiring hierarchical reasoning.
3. HGeoKG
This section introduces the schema design and data construction methods of HGeoKG. First,
Section 3.1 provides an overview of HGeoKG. Then,
Section 3.2 discusses the ontology schema, regional hierarchical structure, and multi-granular relationships in HGeoKG. Finally,
Section 3.3 details the specific construction process of HGeoKG’s data layer.
3.1. Overview
Figure 1 presents the overall framework of HGeoKG, comprising two core layers: Schema and Data, organized according to the workflow from data to knowledge graph construction. The Schema layer consists of three components: ontology design, spatial relationship hierarchical structure design, and regional hierarchical structure design. This layer defines the overall structure and organizational rules of the geographic knowledge graph, providing theoretical guidance and framework support for subsequent data processing. The Data layer illustrates the specific construction process of the geographic knowledge graph, with data sources including administrative boundary data, OSM polygon data, OSM line data, and OSM point data. Data processing is primarily divided into three steps: first, the extraction of geographic entities and attributes; second, the extraction of spatial relationships; and finally, the extraction of regional layering and partitioning. This series of steps ultimately achieves the data generation and construction of the geographic knowledge graph.
3.2. Schema Layer
This subsection conceptualizes and implements the schema layer to integrate geographic knowledge.
3.2.1. HGeoKG Ontology
The ontology of HGeoKG, as shown in
Figure 2, illustrates the attribute information of geographic entities and their spatial relationships. In our ontology design, the attributes of geographic entities are categorized into two types: general attributes and heterogeneous attributes. General attributes include the spatial types of geographic entities and the common sense categories. The definitions of geographic entity categories in
Table 2 are based on the official OSM documentation (
https://download.geofabrik.de/osm-data-in-gis-formats-free.pdf (accessed on 1 January 2024)). Specifically, “Point”, “Line”, and “Polygon” represent point, line, and polygon geometric shapes, respectively. The subclasses within each category (such as roads, railways, buildings, etc.) are directly derived from the OSM classification system to ensure their broad applicability. Heterogeneous attributes provide unique descriptive features for each geographic entity, which are not shared by all entities. The types of spatial relationships in
Table 3 are derived from the topological spatial relationship model known as the Dimensionally Extended 9-Intersection Model (DE-9IM) (
http://docs.geotools.org/latest/userguide/library/jts/dim9.html (accessed on 1 January 2024)). The spatial relationships between geographic entities describe the spatial semantic connections between them. This ontology design enables HGeoKG to represent geographic knowledge with greater accuracy and comprehensiveness.
3.2.2. Spatial Relationship Hierarchical Structure
Considering the spatial relationships based on entity types enables the more effective modeling of potential human, commercial, and economic semantic connections between geographic entities, which are not easily revealed by distance-based spatial relationships alone. In this study, we explicitly model these latent semantics, with spatial relationships serving as the bridge that carries these hidden meanings. A straightforward example illustrates this: typically, stationery stores are located near primary and secondary schools. The spatial relationship of “school-adjacent-stationery store” not only uncovers the commercial connection between schools and stationery stores but also, through the explicit modeling of this relationship, allows for the more effective use of the semantic information inherent in the geographic entities themselves. These type-based spatial relationships can be viewed as prior semantic rules extracted from the data, revealing semantic content that pure distance-based spatial relationships cannot express. This modeling approach plays a crucial role in achieving a comprehensive semantic representation of geographic knowledge.
The general spatial relationship types, as shown in
Table 3, simply reflect the spatial semantics between geographic entities. In this study, the spatial types of geographic entities (point, line, polygon) and their common sense types, as shown in
Table 2, are integrated into the representation of spatial relationships. This has led to an extension of these relationships at different levels of granularity, and the construction of a hierarchical structure, as shown in
Figure 3. As shown in
Figure 3a, the coarse-grained spatial relationships integrate the spatial type semantics of two geographic entities. Correspondingly, as illustrated in
Figure 3b, the fine-grained spatial relationships incorporate the common sense type semantics of the same entities. This explicit modeling approach of hierarchical spatial relationships, which combines entity types of different granularities, enables a more specific and accurate expression of spatial semantics in geographic knowledge.
We incorporated geographic entity type information into the spatial relationship modeling. Based on the richness of the entity type information, spatial relationships were categorized into different granular hierarchical structures. In the subsequent experiments detailed in
Section 4.3.5, we evaluated the impact of spatial relationships at various hierarchical levels on geographic knowledge embedding learning, further validating the expressive capability of the knowledge graph.
3.2.3. Regional Hierarchical Structure
Spatial heterogeneity reflects the variation and diversity of geographic phenomena, exhibiting inherently uncontrollable spatial patterns. To promote the study of geographic knowledge heterogeneity across regions, this paper proposes a hierarchical regional structure based on real-world administrative division data, as illustrated in
Figure 4. HGeoKG first partitions OSM data into larger regions using coarse-grained administrative division data and subsequently further subdivides these large regions with finer-grained administrative division data, thereby forming a more detailed regional hierarchy. This hierarchical regional structure not only reveals spatial heterogeneity within each level and the interactions between coarse- and fine-grained regions but also enables HGeoKG to accurately analyze geographic entities and their relationships within each region. This facilitates more refined modeling, thereby comprehensively enhancing the accuracy and granularity of geographic knowledge representation.
In existing GeoKGs, geographic entities are typically assigned discrete spatial locations, usually represented by a single latitude and longitude coordinate. However, the regional distribution of geographic entities and the spatial heterogeneity between regions are crucial for their semantic representation across different areas. Current GeoKGs have not sufficiently accounted for the influence of these factors on the comprehensive semantic representation of geographic entities. The set of geographic entities within a specific region reflects the region’s human, economic, ecological, and transportation conditions, and the distributional differences of these factors between regions are of significant importance for cross-regional studies. Therefore, incorporating regional distribution constraints and prior knowledge into GeoKGs, as well as constructing benchmark datasets for spatial heterogeneity research, is essential for exploring the homogeneous and heterogeneous patterns and rules across different regions.
HGeoKG constructs a regional hierarchical structure based on real administrative divisions. This hierarchical structure effectively captures spatial heterogeneity through multi-level regional representations. For example, fine-grained regions at various levels are used to characterize the local features of geographic spaces, while coarse-grained regions reflect global characteristics. This approach provides a more comprehensive depiction of spatial heterogeneity and distribution differences.
3.2.4. Meta-Analysis and Geographic Entity Examples
For each entity, we use the unique id of the original OSM element as its name and use the tags from the OSM element as the entity’s attributes. Geographic entities are connected through spatial relationships.
Figure 5a provides an example of a resource description framework (RDF) triple file in Turtle format for a GeoKG. This example includes information about the geometric spatial type of the entity, its common sense type, and various heterogeneous attributes such as name tags, business hours, and more. In addition to the attributes of the geographic entities themselves, the example also includes spatial semantic relationships between entities, such as adjacency and intersection.
Figure 5b also shows a visualization of the GeoKG, intuitively displaying the attributes of geographic entities and the spatial relationships between them.
3.3. Data Layer
This subsection presents the data layer of HGeoKG for extracting, processing, and integrating geographic entities, attributes, spatial relationships from OSM data, and constructing the regional hierarchical structure based on reference spatial region data.
Figure 6 illustrates the complete process of building the hierarchical GeoKG from OSM data. First,
Section 3.3.1 describes the extraction of attribute information for geographic entities. Then, in
Section 3.3.2, GIS tools are used to compute the regional hierarchical divisions of geographic entities and their spatial relationships within the regions. Finally,
Section 3.3.3 details the integration of geographic entities’ attribute information and spatial relationships to construct the complete GeoKG, alongside data storage and visualization examples.
3.3.1. Entity and Attribute Extraction
This subsection focuses on extracting attribute information for geographic entities from OSM data. First, specific geographic regions are identified, and the OSM data for these regions, including protocolbuffer binary format (PBF) and shapefile (SHP) format files, are downloaded. PBF files contain the complete attribute information for each geographic entity, while SHP files primarily provide the spatial types and spatial information of geographic entities. Osmium is an efficient library specifically designed for processing OSM data, particularly adept at handling large-scale datasets. By utilizing the Osmium library, attribute information for geographic entities is extracted from PBF files. This extraction results in triples formatted as (geographic entity, attribute, attribute value), with attribute names converted to camel case. Inferring the line and polygon types of geographic entities from latitude and longitude information in raw PBF data poses significant challenges. Consequently, GIS tools are employed to extract spatial-type information from SHP files, resulting in spatial-type triples formatted as geographic entity, spatial type, and point/line/polygon. It is essential that the extracted geographic entities contain at least one attribute; those with only latitude and longitude and lacking additional attributes will be filtered out. The triples extracted in this subsection represent only the relationships between geographic entities and attribute entities, without establishing direct connections among geographic entities.
3.3.2. Spatial Relationship Extraction
After the extraction of entities and attributes, the KG still lacks spatial relationships between geographic entities. This subsection introduces the data-processing methods for extracting spatial relationships between geographic entities. Utilizing the neighborhood analysis and spatial join functions of GIS tools, spatial relationships among geographic entities within the same region are extracted, resulting in triples formatted as geographic entity, spatial relationship, and geographic entity.
Table 3 presents the spatial relationships between point, line, and polygon geographic entities, including relationships such as containment, adjacency, and intersection. Based on the spatial relationships designed in
Section 3.2.2, this study explicitly enhances the common sense semantic information of spatial relationships by incorporating geographic entities at different hierarchical levels. In the following
Section 4.3.5, we will discuss, through experiments, the impact of explicitly integrating entity type information of varying granularity into spatial relationships.
3.3.3. Regional Hierarchical Division and Extraction
As illustrated in
Figure 6, this study employs the Clip operation within GIS tools to partition geographic entities from OSM data into administrative regions of varying hierarchical levels, thereby obtaining collections of geographic entities within coarse-grained or fine-grained regions at each level. Specifically, we first utilize coarse-grained administrative region data to perform an initial division of the OSM data, generating several coarse-grained regions. Subsequently, based on fine-grained administrative division data, we further subdivide the OSM data within these coarse-grained regions to form fine-grained regions. As shown in
Figure 4, the regional hierarchy progresses from coarse to fine, with each coarse-grained region encompassing multiple fine-grained regions, and as the granularity increases, the number of fine-grained regions progressively increases. This multi-level regional partitioning method effectively reflects differences in data distribution and other aspects across regions.
Through this approach, HGeoKG is capable of effectively managing geographic entities within administrative regions of varying hierarchical levels and facilitates the computation of spatial relationships between entities within each region. Compared to other GeoKG methods that typically employ a unified global hierarchical structure or simple planar models, HGeoKG excels in capturing the spatial structures and regional differences at each hierarchical level. GeoKG methods that lack an effective modeling of regional hierarchical structures fail to adequately differentiate the details of various hierarchical regions and do not fully consider regional differences, which can lead to insufficient spatial semantic representation. By implementing multi-level regional divisions, HGeoKG not only meticulously reflects the characteristics of each hierarchical region but also comprehensively enhances the accuracy and granularity of geographic knowledge representation.
Furthermore, the dataset files of HGeoKG are stored separately for each hierarchical level of the regions, with each region’s data organized into individual files. This file structure allows the knowledge of each region to be used and studied independently, further enhancing the flexibility and operability of HGeoKG in geographic knowledge processing.
Moreover, for geographic entities that span multiple regions (such as streets crossing multiple Census Tracts), our approach is to retain information about the entity in each relevant regional dataset. This method ensures that each regional dataset fully reflects the entities it contains. For cross-regional spatial relationships (such as linear spatial relationships connecting different regions), we choose to store them separately rather than directly integrating them into the regional datasets. These cross-regional relationships have been organized into independent data files and are included with the project files.
3.4. Generalizability and Scalability of HGeoKG
This subsection discusses the generalizability and scalability of HGeoKG. Firstly, regarding generalizability, HGeoKG is constructed using multi-source heterogeneous geographic data, including OSM point, line, and polygon data, as well as administrative boundary data. This enables it to cater to the geographic knowledge representation needs of different regions. Our hierarchical structure design, which includes spatial relationship hierarchy and regional hierarchy, is not only applicable to the data of the current experimental area, but also provides a transferable modeling framework for other geographic regions. Additionally, the construction process and methods of HGeoKG can be applied to various types of geographic datasets, demonstrating good generalizability.
Secondly, in terms of scalability, HGeoKG adopts a modular design, separating the Schema layer from the Data layer. This design ensures the convenient incorporation of new data and new relationships. Specifically, the Data layer can be dynamically expanded based on different regions or larger-scale data sources, while the ontology structure and hierarchical design of the Schema layer can be reused, supporting efficient knowledge updates and expansions.
5. Conclusions
This paper proposes a hierarchical geographic knowledge graph, HGeoKG, which provides a comprehensive semantic representation of geographic knowledge, encompassing rich attributes and spatial relationships, while featuring both regional and spatial relationship hierarchies. It offers theoretical and methodological references for constructing GeoKGs. Extensive geographic knowledge reasoning experiments on HGeoKG demonstrate that the performances of most knowledge graph embedding (KGE) models are significantly affected by the marked regional heterogeneity and long-tail distribution patterns in the HGeoKG dataset, resulting in unsatisfactory embedding quality. This highlights the importance of considering different modeling strategies for different regions and improving the embedding quality of long-tail geographic entities when designing or deploying HGeoKG embedding algorithms in practice. Current evaluation metrics do not adequately capture the effects of spatial heterogeneity, and designing suitable metrics specifically for geographic datasets remains a crucial direction for future research.
We believe that HGeoKG can serve as a valuable new benchmark for studying the characteristics of geographic knowledge and evaluating geographic knowledge representation learning. However, the open-source geographic information used in this study (e.g., OSM) may suffer from issues such as incompleteness and inconsistency. These data deficiencies could have a significant impact on the results, particularly in regions with imbalanced entity types or incomplete annotations. Additionally, the use of administrative boundaries as the basis for geographic unit division in this study could introduce certain biases. The administrative divisions were not specifically designed for this study, and their spatial distribution may not be fully compatible with the model’s requirements. Finally, as this study focuses on hierarchical geographic knowledge graph embedding and inference, the model’s ability to handle extreme conditions in specific tasks (e.g., sparse or heterogeneous data distributions) is still limited. Future work will further explore methods to improve model performance under these conditions.