1. Introduction
In recent years, the frequency and magnitude of landslide disasters have been on the rise, attributable to factors including climate change, human activities, and geological processes. These developments pose a substantial threat to human lives and property safety [
1,
2,
3]. Consequently, the timely and effective prediction of landslide occurrence times, locations, and scales is of paramount significance in mitigating the losses stemming from landslide disasters. Traditional landslide prediction methods encompass physics-based methods and empirical methods. However, these methods have various issues, including high demands for data accuracy and model builder expertise [
4,
5,
6], expensive modeling and computational costs [
7,
8], and a significant reliance on expert experience [
9,
10,
11,
12,
13]. These issues collectively result in lower efficiency in landslide prediction.
With the development of remote sensing and artificial intelligence technologies, data-driven methods have gradually become the mainstream for predicting landslides. Data-driven methods typically select landslide conditioning factors (LCFs) that affect landslide occurrence as input variables. They use known landslide occurrences or non-occurrences as labels to train models capable of landslide prediction. These data-driven methods can adaptively adjust model parameters using substantial historical data, resulting in improved model generalization and robustness. In contrast to empirical methods, data-driven methods can efficiently handle data with multiple attributes, such as geology, landform, and climate. These datasets are often high-dimensional, but data-driven methods can process them more swiftly, thereby enhancing the landslide prediction efficiency. Commonly used data-driven methods today include support vector machines (SVMs), artificial neural networks (ANNs), and random forest (RF). These methods have been successfully applied in various landslide cases with promising results [
14,
15,
16]. Nevertheless, data-driven landslide prediction methods also have their limitations.
An important challenge lies in the oversimplification of landslide scenarios by data-driven methods. The occurrence of landslides is a multifaceted process shaped by the interplay of various geographical elements. For instance, geological formations, soil compositions, and vegetation cover within distinct geographical settings may collectively affect landslide occurrence. These effects can manifest in explicit or implicit ways. Explicit relations are those that can be distinctly articulated, such as the greater likelihood of landslides in areas with steep slopes or the influence of topography on the flow and distribution of precipitation. Implicit relations, conversely, pertain to connections that defy precise definition. Take, for example, the impact of certain human activities on landslide risk. While both human activities and landslides can be documented, elucidating the precise mechanisms through which human activities influence landslides remains challenging.
We describe these interactions in the geographic environment, whether explicit or implicit, as ‘semantic information in the geographic environment’. Explicit relations are challenging for data-driven methods to express because, when predicting landslides, data-driven methods typically use grid cells as prediction units [
4,
17]. They rely on generating a landslide dataset with a ‘sequence’ structure for each grid cell to train the model and predict landslides, treating each piece of data as an independent sample. Modeling landslide scenarios based on this ‘sequence’ structure makes it difficult to capture the relation between different grid cells. On the other hand, data-driven methods have difficulty in capturing implicit relations in semantic information. For the same reason, this is because the sequence structure in the data-driven method cannot learn the relations between sequences during the training process. The data-driven method loses semantic information during modeling and training, which reduces the accuracy of landslide prediction. In short, this simplified modeling method cannot adequately represent the complexity of landslide scenarios, which in turn has an impact on the accurate prediction of landslides.
To address this issue, we propose the modeling of landslide scenarios based on a knowledge graph [
18,
19,
20,
21]. The “graph” structure within the knowledge graph can more directly represent the explicit relations between each LCF. It is also more conducive to uncovering the implicit relations between grid cells and LCFs, allowing us to discover spatial patterns in the landslide process.
Figure 1 illustrates the contrast between the data-driven method and the knowledge graph method for modeling landslide scenarios. It is evident from the figure that modeling landslide scenarios using the “graph” structure outperforms the “sequence” structure, particularly in capturing semantic information.
Furthermore, we advocate performing landslide prediction based on knowledge graph embedding (KGE) [
1,
2,
19]. KGE assigns semantic interpretations to the vectors of entities and relations within the knowledge graph by learning semantic associations in the vector space. This means that similar entities and relations are also similar in the vector space. In some applications, knowledge graph embedding (KGE) has demonstrated the ability to capture complex semantic relationships [
22,
23] and improve performance under conditions of data scarcity [
24,
25]. For landslide prediction, compared to data-driven methods, KGE can automatically learn the influence of local context factors (LCFs) in the vector space based on entities and relations. KGE effectively maps multi-source data into a vector space, enhancing its capability to capture the implicit relations between LCFs. Additionally, in cases of sparse datasets, KGE can infer relations and patterns within the limited data, thus filling in gaps and improving the accuracy of landslide prediction methods. Consequently, it contributes to the enhancement of the precision and generalization ability of landslide prediction methods.
In this study, we present a comprehensive approach to construct a knowledge graph tailored for landslide prediction, effectively transforming this task into a graph-based link prediction problem using KGE techniques. The paper is structured as follows:
Section 2 introduces the study area and data sources.
Section 3 details the entire process of landslide prediction using knowledge graph techniques, encompassing data preprocessing, knowledge graph creation, and the application of KGE for prediction.
Section 4 offers a comparative analysis with a generic landslide prediction model and showcases our prediction results.
Section 5 delves into the strengths and limitations of employing KGE in landslide prediction. Finally,
Section 6 concludes the paper, summarizing our findings and contributions.
2. Study Area and Data
Xiji County is situated in the southern part of Guyuan City, within the Ningxia Province, China. It is positioned between longitude 105°20′–106°04′E and latitude 35°35′–36°14′N and is geographically proximate to the western foothills of the Liupan Mountains. Xiji County is situated in the Loess Plateau, characterized by an arid hilly landscape. The terrain encompasses Hulu River plains, loess hills and gullies, and soil and rocky mountains. The elevation gradually increases from south to north, spanning from 1688 to 2633 meters. The susceptibility to loess landslides in Xiji County is attributed to its rugged terrain and narrow ridges. The combination of these geographical factors creates conditions conducive to landslide occurrences, making the region a suitable area for the validation of landslide prediction methods. To support our study, we gathered 741 landslide records from the Ningxia Remote Sensing Survey and Mapping Institute, forming the basis for constructing the knowledge graph. The study area’s details are depicted in
Figure 2.
The environmental data obtained from the study area consist of seven categories: geology, landform, soil, climate, vegetation, transportation systems, and population. These data sources are multi-sourced. We supplemented the non-public data, provided by the Ningxia Remote Sensing Survey and Mapping Institute, with terrain, precipitation, and road data from various public datasets. Each data category comprises multiple LCFs, and all LCFs from the environmental data are recorded in the schema layer of the knowledge graph.
Table 1 displays the LCFs within each environmental data category, and
Table 2 provides detailed information about each category of environmental data. It is worth noting that the time span of these data coincides with the time range of the landslide records.
3. Methodology
The knowledge graph-based landslide prediction method comprises three stages, as shown in
Figure 3. Initially, data from various sources are gathered and subjected to preprocessing to create a data collection consisting of a tile list, landslide inventory, and environmental data, all in a standardized format. The tile list is a record of coordinates covering the study area, where each tile functions as a grid cell for landslide prediction. The landslide inventory includes details of past landslide events in the study area, encompassing factors like location, magnitude, landslide type, and the resulting impact. This inventory forms the foundation for landslide susceptibility and risk assessment [
4,
31,
32]. Next, the data in this uniform format are transformed into a collection of triples, an example of which is illustrated in
Figure 4. A triple is the fundamental unit of the knowledge graph, denoted by
, where
h represents the head entity,
t represents the tail entity, and
r signifies the relation between them. These triples are then used to construct the knowledge graph according to the designed schema. Finally, the knowledge graph undergoes embedded representation learning, while the landslide prediction task is redefined as a graph-based prediction task, enabling the assessment of susceptibility within the study area through link prediction.
3.1. Preprocessing
The purpose of preprocessing is to standardize the format of heterogeneous data from various sources and to create a collection of tile coordinates for the study area. Using tiles as the mapping unit for landslide prediction offers the advantage of adaptability to various geographic scales and efficient processing. During the preprocessing stage, a tile collection is created based on the specified tile level. Specifically, the study area’s location is represented in tile coordinates following the Web Mercator rule [
33]. The conversion rules between real latitude and longitude coordinates and tile coordinates are as follows:
where
and
denote the input longitude and latitude coordinates, respectively. The transformed horizontal and vertical coordinates of the tile are denoted by
x and
y, while
z denotes the zoom level of the tile. Each tile corresponds to a specific set of longitude and latitude coordinates, and the quantity of coordinates varies depending on the zoom level. Notably, higher zoom levels lead to tiles with fewer coordinates, enhancing spatial accuracy at the expense of increased computational complexity. In this paper, we select zoom level 18, as illustrated in
Figure 5, to strike a balance between spatial accuracy and computational complexity. With level 18, the study area comprises a total of 205,330 tiles.
Then, the format of the heterogeneous data from multiple sources is harmonized, and an indexed list of data attribute values and tiles is generated. The data structure primarily comprises vector points (e.g., landslide records), vector lines (e.g., hydrological data), vector surfaces (e.g., geological data), raster data at various scales (e.g., terrain data), and CSV files (e.g., population distribution data). Initially, the geographic coordinate systems of these data are standardized. Subsequently, we extract the corresponding attribute values from each tile. For discrete attribute values (e.g., fault types with a limited number of values like normal fault, reverse fault, strike-slip fault, and hidden fault), it is relatively straightforward to create an index list for each tile based on these attribute values. However, for continuous attribute values (e.g., elevation, which is continuous), we first discretize the attribute values by assessing the data and selecting the appropriate scale, and then generate an indexed list for each tile using these discretized values of the attribute.
3.2. Knowledge Graph Construction
3.2.1. Schema Layer
The schema layer defines the structure and specifications of the concepts within the knowledge graph. It serves as a metadata model for describing the relations between entities in the knowledge graph, including their attributes. The schema layer establishes a unified semantic framework for the data in the knowledge graph. This framework enhances the organization, query capabilities, interpretability, and aids in reasoning computations. In this paper, the schema layer for describing disaster scenarios is composed of a basic vocabulary and an ontology that defines concepts related to disasters, as shown in
Figure 6.
We utilize the resource description framework (RDF) [
34] and resource description framework schema (RDFS) [
35] vocabularies to establish the foundational terminology within the knowledge graph. RDF is a standard designed for representing relations among resources in the semantic web. RDFS, an extension of RDF, is responsible for defining more intricate hierarchies between resources. RDF/RDFS encapsulates knowledge within triples, each composed of subject, predicate, and object. In RDF/RDFS triples, the subject corresponds to the head entity within the knowledge graph triple, the object corresponds to the tail entity, and the predicate signifies the relation, as illustrated in
Figure 7.
The RDF/RDFS vocabularies employed in this paper primarily encompass: rdf:type, rdf:subject, rdf:predicate, rdf:object, rdfs:Class, and rdfs:subClassOf. To enhance the clarity of our proposed approach, we omitted the resource prefixes and retained only the relation prefixes in this paper. The utilization of RDF/RDFS vocabularies offers a fundamental and standardized method for data description. Additionally, the schema layer constructed using RDF/RDFS enables more effective comprehension and processing of heterogeneous disaster data from various sources.
Furthermore, we utilize the GeoSPARQL glossary [
36,
37] to define the fundamental spatial information vocabulary for disaster scenarios. The GeoSPARQL glossary encompasses a comprehensive set of terms designed for representing geospatial information. These terms facilitate the description of various spatial aspects, including geographical coordinates, types of geographic elements (e.g., points, lines, and polygon), and geospatial relations (e.g., intersects, overlaps, disjoint, etc.) within disaster records, disaster-related tasks, and LCFs in the study area. GeoSPARQL offers a standardized approach for integrating and sharing geospatial data. Utilizing the GeoSPARQL terminology ensures a consistent interpretation and utilization of disaster data from diverse heterogeneous sources.
Ontology [
38] is a formal knowledge representation for describing concepts, entities, attributes, and relations within a domain. The goal of ontology is to capture consensus and semantics within a domain, enabling different systems to share and comprehend domain-specific knowledge. Leveraging the expressive advantages of ontology, we designed and implemented the natural disaster emergency ontology (NADE ontology). NADE is capable of representing the semantics among landslide data while also supporting extensions to other disaster domains. Given the characteristics of disaster-related data, we subdivided the NADE ontology into three parts: NADE-Core, which delineates the fundamental concepts of disasters; NADE-Environment, which describes the disaster environment; and NADE-Task, which outlines the disaster tasks. For each term in the NADE ontology, we referenced relevant disaster emergency standards and existing disaster ontology definitions when formulating their definitions. Additionally, we considered the practical aspects of disaster emergency task handling.
NADE-Core establishes the essential vocabulary for describing disasters and forms the foundation of the NADE ontology. For instance, at any stage of a disaster, fundamental attributes such as the disaster type, its current phase, the affected objects, and the resulting impacts must be described. NADE-Environment defines the concept of the environment within a disaster scenario. Environmental changes are often the root causes of disasters, and social environmental factors directly influence the extent of damage caused. Providing a unified description of the environment in a disaster scenario helps identify patterns in disaster occurrences within a specific region. NADE-Task specifies the indicators of a disaster task, encompassing elements such as risk, hazard, impact, severity, likelihood, susceptibility, exposure, vulnerability, and their relations.
Figure 8 illustrates the core vocabulary and primary relations defined by the NADE ontology.
3.2.2. Data Layer
The data layer converts the data generated in the preprocessing stage into triples. These triples include tile triples, record triples, and factor triples. Tile triples describe the positional relation between tiles, for example: ((831,878, 410,956), nade:hasAdjacentTile, (831,878, 410,957)). Record triples are generated by combining tile entities and landslide record entities. Each landslide record entity represents information about landslide events in the study area, for example: (disaster_record_6b65e22d, nade:hasDisasterType, landslide). With the integration of the GeoSPARQL ontology from the schema layer, record triples can provide detailed spatial information about landslide events, for example: (disaster_record_6b65e22d_geom, rdf:type, point). Factor triples are created by combining tile entities and environment entities, offering descriptions of the environmental properties in each tile in the study area, for example, for discrete factor values, generate triple((831,878, 410,956), nade:hasFactorPropertyType, hidden_fault). For continuous factor values, scale them into discrete categories, and then generate triples, for example, ((831,878, 410,956), nade:hasFactorPropertyType, altitude_range_3). When combined with the NADE ontology in the schema layer, the tile triples, record triples, and factor triples illustrate the relations among multiple sources of heterogeneous data within the disaster scenario, which is essential for effective landslide hazard modeling in the study area.
In the prediction phase, the entities and relations to be forecasted within the knowledge graph are also represented as triples in the data layer. For instance, consider the triple (disaster_task_e3a9851a, nade:hasLevel2, susceptibility), where ‘disaster_task_e3a9851a’ denotes the landslide tasks to be predicted, ‘susceptibility’ denotes the susceptibility level, and ‘nade:hasLevel2’ denotes that the task with the ID ‘e3a9851a’ is associated with the susceptibility level 2. Each landslide record and task possesses a unique ID linked to an individual tile. Since tiles serve as the mapping units for landslide prediction, the core of knowledge graph-based landslide susceptibility prediction lies in forecasting the susceptibility level of each tile’s associated landslide task.
After creating the triples in the data layer, these triples are mapped to the schema layer to generate the knowledge graph.
Figure 9 illustrates the connection between the data layer and schema layer. This connection primarily involves mapping tile triples to the NADE-Environment (i.e., factor triples), mapping record triples to NADE-Core, mapping record triples to GeoSPARQL, and particularly, establishing links between task triples and NADE-Task during the prediction phase.
3.3. Knowledge Graph Embedding
KGE is a representation learning technique that aims to map information about entities, relations, and attributes in a knowledge graph into a continuous vector space to better capture implicit relations between entities. KGE can measure the semantic similarity between entities and the semantic associations of relations in a vector space. For instance, if two entities share similar relations in the knowledge graph, their vector representations will be closer in the embedded vector space. KGE models typically learn the vector representations of entities and relations by minimizing or maximizing a loss function, preserving semantic relations between entities in the knowledge graph within the embedded vector space.
3.3.1. Task Formalization
In this paper, we transform the landslide prediction task into a knowledge graph-based link prediction task. Link prediction involves predicting potential connections within a knowledge graph by analyzing the information associated with existing nodes and edges. It aims to determine the probable relation, denoted by
r, between a given pair of entities, represented as
:
where the vectors of entities
h and
t are denoted by
and
, respectively, and the vector of relations
r is denoted by
. The
denotes the score of the relation
r between the entity pairs
; if the value of
is larger, it means that it is more likely that the relation
r exists between the entity pairs. The function
f denotes a function that maps entity vectors and relation vectors to scores, the exact form of which can be chosen according to different KGE models, e.g., using the
norm [
39].
The aim of link prediction is to forecast new relations that may exist within the knowledge graph but have not been revealed yet, based on known entities and relations. In the context of landslide prediction, this entails forecasting the level of the landslide task indicator. For instance, when predicting landslide susceptibility for a grid cell with task number ‘e3a9851a’, the objective of link prediction is to anticipate which level of susceptibility is more likely to be associated with a pair of entities: a head entity of ‘disaster_task_e3a9851a’ and a tail entity of ‘susceptibility’.
3.3.2. Model Training
The training of the KGE model first involves ternaries that are present in the knowledge graph, i.e., positive samples. The goal of the model is to embed these positive samples into the vector space in order to capture the semantic information between entities and relations through vector operations. On the other hand, negative samples need to be introduced to train the KGE model and enhance its ability to recognize triples that do not exist in the knowledge graph. For each positive sample , a head entity h or a tail entity t is randomly selected and replaced by an irrelevant head entity or a tail entity , thus generating a negative sample or . The positive and negative samples are used together to train the KGE model, with the training objective being to minimize the embedding distance of the positive samples while maximizing the embedding distance of the negative samples. This process enhances the model’s ability to understand and represent the knowledge graph.
In this paper, five typical KGE models are used for training and their score functions are shown in
Table 3. The loss functions defined in the training are as follows:
where
denotes the loss function, which is the objective function we aim to minimize during training. The triple
denotes a positive triple, and
denotes a negative triple.
and
denote the sets of positive and negative triples, respectively.
denotes the margin used to ensure a minimum score difference between positive and negative triples.
3.3.3. Prediction
In this paper, landslide susceptibility assessment is employed as an illustrative example for landslide prediction. Initially, the KGE-based link prediction model generates scores for various landslide susceptibility levels on each grid cell, representing the likelihood of each grid cell belonging to different susceptibility classes. Subsequently, the highest scoring susceptibility class is selected for each grid cell, thereby determining the susceptibility result for that particular location. Finally, the susceptibility results for each grid cell are aggregated to produce a susceptibility map for the entire study area.
5. Discussion
In our experiments, we compare the general landslide prediction system with our prediction system to demonstrate the feasibility and advantages of modeling landslide scenes based on a knowledge graph. On one hand, the experimental results show that the knowledge graph-based modeling of landslide scenarios is more useful for discovering spatial patterns in the landslide process than the traditional modeling method based on the “sequence” structure. This benefit stems from the nature of the knowledge graph in the process of data organization and representation, which has the ability to convey semantic information in model training, thus enhancing the model performance. Additionally, based on the results of entity similarity comparison, we found that the KGE model can indeed learn logical entity embedding representations, with semantically similar entities having similar distances in the vector space. This contributes to the correct results in the link prediction process. Similarly, the susceptibility map generated using the KGE model demonstrates this semantic representation capability. Moreover, the advantages of KGE models become more significant when the dataset is sparse. Data-driven models usually require substantial data for training and prediction. With sparse data, it is challenging for these models to effectively learn and generalize from missing data. The advantage of KGE models lies in the mapping of entities and relations to the embedding space, which assists the models in inferring missing values and gaining a deeper understanding of the underlying relationships in the data. It is worth noting that the tile level also affects the performance of the model. A smaller tile level can enhance the training and prediction speed of the model. However, due to the lower resolution, there are too many LCFs on each tile, preventing the model from fully learning the relationship between LCFs and grid cells, ultimately reducing the model’s performance. On the other hand, if the tile level is too high, it significantly decreases the training and prediction speed of the model. Additionally, it may lead to an overly sparse knowledge graph, hindering the model’s ability to effectively learn features. Therefore, selecting an appropriate tile level is crucial.
On the other hand, predicting landslides using KGE models is a novel and comprehensive end-to-end method. General data-driven methods typically involve manually selecting, designing, and extracting environmental features, and then using those features to train a model for a prediction task. These methods typically require multiple steps, including data preprocessing and LCF analysis and selection. These steps often necessitate the involvement of domain experts and multiple individual modules. For KGE models, in contrast, the data preprocessing step is performed only once when constructing the knowledge graph, enhancing data reusability. Moreover, KGE models generate predictions directly from the embedding space, eliminating the need for manual feature selection or multiple preprocessing steps. Thus, using the KGE model to predict landslides reduces the complexity of manual intervention and engineering design, making the model more easily scalable to other hazard tasks. Based on our experiments, we believe that this method is promising.
However, there are some limitations to our method that are worth noting. In the case of complete data, the advantage of the KGE method over generalized machine learning methods is not very significant, although it exhibits a slight advantage in prediction performance. This is mainly due to the sparsity of the structure of the constructed knowledge graph and the inherent limitations of the KGE model’s learning capabilities. Additionally, when performing data-driven landslide prediction based on the negative samples, which are not truly non-landslide areas, errors may occur during the sample production process, subsequently affecting the quality of the test set. Furthermore, in terms of the training details of the KGE model, the way in which the head entity and tail entity are replaced in negative triples can also affect the performance of the KGE model.
6. Conclusions
Data-driven methods typically simplify landslide scenarios during modeling, resulting in information loss during the prediction process. To address this challenge, this paper presents a novel approach to landslide prediction. We represent complex disaster scenarios by designing the schema layer of the knowledge graph and organize the multi-source heterogeneous disaster data into triples by constructing the data layer of the knowledge graph and mapping it to the schema layer. Subsequently, landslide prediction is conducted using the KGE model. The novelty of the experimental results lies in demonstrating the capability of knowledge graph modeling for complex disaster scenarios, addressing the issue of information loss that occurs in data-driven approaches during modeling. The primary contributions of this paper can be summarized as follows:
For the first time, a knowledge graph embedding method is applied to landslide prediction, resulting in a performance improvement, marking an innovative approach in this field.
With the assistance of a graph-based modeling method, we improve the exploration of spatial information within landslide scenarios.
We introduced a novel end-to-end assessment method for the precise evaluation of landslide susceptibility, which holds extensive applicability.
Our method empowers effective landslide prediction even with limited data, offering support for applications in resource-constrained environments.
In future research, a primary focus will be on refining the precision of landslide prediction, encompassing areas with sufficient sample data and those lacking historical landslide records, as experimental results suggest promising potential in both scenarios. Addressing the prediction bias stemming from the sparse knowledge graph structure will entail an exploration of a more sophisticated schema layer. Concurrently, improvements to the model structure will be actively pursued, including the integration of graph neural network models to capture higher-order interactions among entities and relationships. This strategic enhancement aims to enhance the predictive capabilities of the model. Furthermore, endeavors will be directed towards expanding the scope of downstream tasks to include other disaster types, thereby augmenting the method’s versatility and the utility of disaster data.