1. Introduction
Knowledge graph [
1] is a method for representing a knowledge base as a directed graph, where nodes and edges represent entities and relationships between entities, respectively. Knowledge graphs play an important role in practical applications in areas such as semantic search [
2], dialog generation [
3], question and answer (Q&A) [
4], and recommendation algorithms [
5]. However, knowledge graphs often face the problem of missing relations, i.e., there are unknown relations in the graph. To address this problem, knowledge base completion tasks [
6], also known as link prediction, have emerged to predict whether a given triple is valid, e.g., (Yao Ming, gender, man).
Currently, models based on knowledge embedding and convolutional neural networks have achieved satisfactory results in link prediction. However, the current knowledge embedding methods [
7] do not take into account the subtle links between entities and relationships, although they contain multi-hop ternary information with high quality. Multi-hop ternary information is used to add multiple intermediate entities and relationships between the head entity, relationship, and tail entity; adding new entities and relationships can constitute more triads. In addition, existing convolutional neural network models mainly perform convolutional operations on the ternary directly while ignoring the sequential relationships between entities and relations in the ternary.
Based on the above observations, this paper proposes a link prediction model based on feature mapping and bidirectional convolution. (1) Firstly, the multi-hop ternary information is fed into the graph attention network to realize the encoding of entities and relations. (2) After encoding entities and relationships, the entities are mapped according to the space of relationship encoding to acquire the encoding of entities in the space of relationships, which can capture the links between entities and relationships. (3) After the mapped entity-relationship encoding is obtained, the ternary is combined in two sequences in reverse order, and the combined ternary is encoded and then fed into the convolutional neural network, which is trained by the r-drop method. The bi-directionally combined ternary will be more advantageous in predicting the head and tail entities after the information is extracted by the convolutional neural network. Our architecture model includes a coding layer, mapping layer, and decoding layer, in which the multi-hop graph neural coding layer, the entity relationship transformation layer, and the bidirectional convolutional layer play the roles of coding layer, mapping layer, and decoding layer, respectively.
The contributions of this paper are as follows. To the best of our knowledge, this paper introduces, for the first time, the mapping of entities and relations in a convolutional neural network for dealing specifically with the problem of link prediction on knowledge graphs. This mapping mechanism is the innovation of this paper. Secondly, we extend the convolutional neural network using bidirectional convolution to extract information from ternary groups. This bi-directional convolutional approach is important in the link prediction task. Finally, we evaluated our model in detail on two datasets, FB15K-237 and WN18R. We conducted corresponding experiments on both datasets, and the results demonstrate that the method proposed in this paper significantly outperforms the baseline model.
2. Related Work
Recent link prediction methods can be broadly categorized as follows: translation-based models, multiplication-based models, CNN-based models, rotation-based models, GNN-based models, and others.
2.1. Translation-Based Models
Bordes et al. [
8] proposed the TransE model, which is modeled by interpreting triples as translations of entities over low-dimensional embeddings. Although TransE achieves good results, it cannot handle complex relations, such as self-inverse, one-to-many, many-to-one, and many-to-many cases. For this reason, Wang et al. [
9] proposed the TransH model, which models a relation as a hyperplane and performs translation operations on it, which can well maintain the mapping nature of the relation. However, an entity may have more than one aspect, and various relationships may focus on different aspects of the entity, so Lin et al. [
10] proposed the TransR model, which constructs entity and relationship embeddings in separate entity and relationship spaces and projects the features between entities to model the translation. Ji et al. [
11] proposed a finer-grained TransD model, which uses two vectors to represent entities and relationships. Not only the diversity of relations but also the diversity of entities is considered. Fan et al. [
12] proposed the TransM high-level model, which utilizes the structure of the knowledge graph to pre-compute different weights for each training triad based on its relation mapping attributes. In this way, the optimal function handles each ternary that depends on its weights, improving the overall performance of the model. Zhang et al. [
13] proposed the TransAP model, introducing location-aware embeddings and self-attention blocks which proved capable in adapting to various translational distance models. Lu et al. [
7] proposed PTransE and AMIE+ to extract multi-hop and rule-based relationships.
2.2. Multiplication-Based Models
Nickel et al. [
14] proposed the Rescal model, a tensor decomposition-based relational learning method. The model is able to collectively learn through potential components to improve the ability to judge ternary groups. Yang et al. [
15] proposed the Distmult model, which captures the relational semantics using a bi-linear goal-learning approach. Trouillon et al. [
16] used complex-valued embeddings and proposed the CompIEX model, where the combination of complex-valued embeddings can handle a large number of binary relations, including symmetric and anti-symmetric relations. Kazemi et al. [
17] proposed the SimplEX model, which allows for independent learning of two embeddings per entity, with complexity that grows linearly with the size of the embeddings. Balažević et al. [
18] proposed the TuckER model, which uses a purely linear model to deal with the ternary prediction problem and achieved relatively good results.
2.3. CNN-Based Models
Dettmers et al. [
19] proposed the ConvE model, a multi-layer convolutional network model as a first attempt to deal with connectivity prediction problems using convolutional neural networks. Subsequently, Nguyen et al. [
20] proposed the ConvKB model, which improves the model performance using convolutional neural networks to capture global relationships and transition features between entities and relationships in the knowledge base. Vu et al. [
21] proposed an embedding model for CapsE, which employs a two-fold capsule network to model the ternary, generating an intermediate hidden layer through multiple filters to produce a continuous vector. Two convolutions are used to improve the performance of the model. Balažević et al. [
22] proposed a hyper-network architecture HypER model that generates simplified relation-specific convolutional filters for relation prediction and also achieved relatively good results. On the other hand, Vashishth et al. [
23] proposed the InteractE model, which employs feature alignment, new feature reshaping, and cyclic convolution, allowing for improved model performance.
2.4. Rotation-Based Models
Sun et al. [
24] proposed a new knowledge graph embedding method, the RotatE model, where each relation is defined as a rotation from the source entity to the target entity in a complex vector space, which is capable of inferring and modeling a variety of relational patterns, and the embedding in the complex vector space also leads to a significant improvement in the performance of the model. Zhang et al. [
25] proposed the QuatE model, which goes beyond the traditional complex-valued representation, introducing a more expressive hyper-complex representation for entities and relations embedded in knowledge graphs, and relations are modeled as rotations in quaternion space, which likewise optimizes the performance of the model. Gao et al. [
26] proposed the Rotate3D model, which maps entities to 3D space and defines relations as rotations from head entities to tail entities, exploiting the non-exchangeable composite property, which naturally maintains the order of the relationship composite. Chen et al. [
27] proposed the HittER model, which consists of two different Transformer blocks: the bottom extracts the features of each entity–relationship pair in the local neighborhood of the source entity, and the top aggregates the relationship information outputted from the bottom. The introduction of the Transformer also improved the model performance. Abboud et al. [
28] proposed a BoXE model based on spatial Transformer embedding, embedding entities and relations in the form of hyper-rectangles. These hyper-rectangles spatially describe the basic logical attributes, the model can capture the information from the rich rule language, and the global reasoning approach also improves the model performance.
2.5. GNN-Based Models
Schlichtkrull et al. [
29] introduced relational graph convolutional networks to deal with link prediction models specifically developed to deal with the multi-relational data characteristics of realistic knowledge bases. Li et al. [
30] also fused graph neural networks with multi-layer perceptron with good results. Tan et al. [
31] proposed knowledge-relational attention networks (KRAT) that utilize the context of the graph to project neighboring triples to different potential spaces simultaneously, capturing the subtle semantic information and importance of different contextual triples. Liang et al. [
32] aggregated multimodal and graph structural information to generate the final embedding of the knowledge graph for knowledge graph completion. Niu et al. [
33] introduced graph convolutional networks (GCNs) into the channel attention. The convolutional encodes the node structural features and represents them in matrix form to fully exploit the node feature information. Zamini et al. [
34] learned entities and relations by embedding them in a low-dimensional vector space with a graph neural network.
2.6. Other Models
Zhang et al. [
35] proposed the regularization method DURA model, which specifically deals with the existing tensor decomposition-based models and regularizes the tensor decomposition-based models, improving the performance of tensor decomposition models. Nathani et al. [
36] proposed the attention feature embedding method, which captures entity and relationship features in the neighborhood of any given entity with significant performance improvement. Chen et al. [
37], on the other hand, proposed a meta-relational learning MetaR framework for the link prediction problem with few samples. Baek et al. [
38] proposed a meta-learning-based framework GEN model for this problem. The model is used for inductive inference for node embedding networks, and for transformational inference for link prediction networks from the unseen entities in link prediction, the uncertainty is modeled, which also improves the accuracy of the model. Bose et al. [
39] proposed a meta-learning approach for link prediction with fewer samples, where in addition to the global parameters, the model learns a graph signature function, which exploits the structural information of the graph as compared to other graphs from the same distribution, resulting in faster adaptation and better convergence than ordinary meta-learning. Xu et al. [
40] proposed a cross-modal attention mechanism that provides bi-directional attention to visual and textual features by integrating relational embedding, thus improving the robustness of the model. Zhang et al. [
41] introduced federated learning to the task of knowledge graph completion. Shomer et al. [
42] introduced a data augmentation approach for knowledge graph completion. Luo et al. [
43] used the sample-less learning approach for the task of knowledge graph completion. Wang et al. [
44] proposed a multi-concept representation learning approach where each entity or relationship is projected to multiple vectors to capture the complex conceptual information hidden in them for optimal embedding.
3. Model Details
The experimental model in this paper is divided into three layers, starting with a coding layer for extracting coded representations of entities and relationships from the input data. The goal of the encoding layer is to capture the important features of the input data. Second is the mapping layer, through which entities are mapped in the relation space. The role of this layer is to model and represent the associations of entities across relationships. Finally, there is the decoding layer, which is used to decode the triples. The decoding layer reconverts the encoded entities and relations into their original ternary form for the link prediction task.
The encoding layer, as shown in
Figure 1a, fuses the single-hop ternary information with the multi-hop ternary information, and the fused information is used to obtain the entity coding and relationship coding through the graph neural network; the mapping layer, as shown in
Figure 1b, because the entity coding and relationship coding obtained from the coding layer belong to different coding spaces, maps the entity coding in the relationship coding space and obtains new entity coding, which strengthens the subtle connection between the entity and the relationship coding. The decoding layer, as shown in
Figure 1c, strengthens the prediction capability of the model by combining the sequential and inverse sequence of the ternary information.
3.1. Encoding Layer
In order to obtain the embedding of entities and relations, it is necessary to learn the representation of each ternary. We have the entity, and relation feature vectors of the single-hop triad
and the multi-hop triad
are concatenated to obtain a richer ternary representation. Multi-hop ternary can carry more information between entities and relations, which will only make the embedding richer. Firstly, the single-hop and multi-hop ternary are merged, where
t denotes the ternary,
s denotes the single-hop information,
d denotes the multi-hop information,
denote entity embeddings, and
denote relation embeddings. First, the single-hop and multi-hop triples are merged as follows.
The linear transformation is then performed with the following equation.
where
is the vector representation of the triplet after linear transformation, the matrix
represents the embedding of the entities
, the matrix
represents the embedding of the entities
, and the matrix
represents the embedding of the relations
.
We perform a linear transformation parameterized by the weight matrix
and then operate on it via the activation function LeakyRelu to obtain the absolute attention
of the ternary with the following formula.
To obtain the relative attention
, the activation function softmax is applied to the following equation.
The embedding
of entity
is the sum of each ternary representation weighted by their attention values, as shown in the following equation.
where
denotes the neighborhood of entity
and
denotes the set of relations connecting entity
and
. To stabilize the learning process and encapsulate more information about the neighborhood, the attention mechanism calculates the embedding and then connects the embeddings with the following equation.
In the last layer of the model, rather than joining the embeddings of multiple heads, an averaging method is used to obtain the final embedding vector of the entity, as shown in the following equation.
However, when learning new embeddings, entities lose their initial embedding information. To solve this problem, we use a matrix vector
to perform a linear transformation with
, where
represents the input entity embeddings of our model, and then add
to the final encoding, which represents the entity information of the initial embeddings, to obtain the result of the linear transformation,
, as shown in the following equation.
The experiments use hinge loss to train the model, and the hinge loss is given by the following expression.
where
is a hyperparameter,
S is the set of valid triples, and
denotes the set of invalid triples. In our architecture, we extend edges to directed paths by introducing an auxiliary relation between two entities. The embedding of this auxiliary relation is the sum of all relation embeddings in the path. Our model iteratively accumulates knowledge from the multi-hop information of the entities.
3.2. Mapping Layer
For relation
r, we locate the relation-specific translation vector
in the relation-specific hyperplane
instead of in the same space where the entities are embedded so that the entity and relation distributions use different coding spaces, and then the projections are used to match the relations and the entities to allow for a more diversified coding of the entities and relations. Specifically, for a triple
, the embedded head entity
h and tail entity
t are first projected to the hyperplane
with projections
and
, respectively. It is expected that
and
can be associated with the translation vector
in the hyperplane if
is the correct triple, and the projection function is shown below.
3.3. Decoding Layer
In our experiments, we employed bi-directional convolution as part of the decoder, aiming to analyze the features of global embedding and bi-directional coding. By employing bidirectional coding, we give the model a stronger edge in predicting entities. Subsequent experimental results demonstrated that bi-directional coding shows good results in improving the performance of the model, and the scoring function of the decoding layer is shown by the following equation.
where
denotes the convolution filter,
denotes the hyperparameter of the number of filters used, ∗ is the convolution operation, and
W denotes the linear transformation matrix used to calculate the final score of the triple.
The r-drop training method is used while decoding due to the randomness of the dropout, and although the same model is passed twice, it is also slightly different in the forward passing process. r-drop makes each batch of data pass through the forward neural network before and after two times to acquire two different distributions. The KL dispersion between the two distributions generated by the r-drop is continuously reduced during the training process, which in turn improves the robustness of the model. The loss function used in the model is shown below.
where
L is the loss function of the decoding layer;
L consists of the softplus function and the
regularization, where the coefficient of the
regularization is
; and
S is the triple of positive samples while
is the triple of negative samples.
4. Experiment and Analysis
4.1. Dataset
We evaluated our proposed method using two benchmark datasets, WN18RR and FB15k-237. WN18RR (WordNet-18 Reversed Relations) is a commonly used knowledge graph link prediction dataset for evaluating and comparing the performance of link prediction models. This dataset is an extension of the WordNet-18 dataset, an English vocabulary network representing relationships between words, with 40,943 entities and 11 relationships. The FB15k-237 dataset was pruned with respect to the original FB15k dataset by removing some of the frequently occurring and relatively easy to predict relationships. This was done to increase the difficulty of the dataset, making it more accurate to assess the performance of the model in dealing with complex and rare relationships, with a total of 14,541 entities and 237 relationships in the dataset. Other specific data are shown in
Table 1.
4.2. Training
Experiments were conducted to partially augment the data by creating two sets of invalid triples, one set replacing the head entities in the triples with invalid entities and the other set replacing the tail entities in the triples with invalid entities. We randomly drew the same number of invalid triples from these two sets to ensure robust performance in detecting head and tail entities. The experimental initial embedding used TransE-generated entity and relation embeddings to initialize our embedding.
The experiment adopted the training process of encoder–mapper–decoder. Firstly, the multi-hop triple information was fed into the graph attention network for encoding, and the encoding of entities and relations of the knowledge graph was obtained. Then, the entities were mapped according to the space of relations, and the mapped entity–relationship encoding was obtained. Finally, the bi-directional convolutional decoder was trained to perform the relation prediction task.
Different parameter settings were used for different dataset experiments. (1) For the WN18RR dataset, we used Adam to optimize all parameters, the coding layer epoch was 3600, the initial learning rate of the coding layer was set to 5 × , the decoding layer epoch was 200, the initial learning rate of the decoding layer was set to 1 × , the decoding layer was 64, and both entity, and the relation embedding of the last layer was set to 200. (2) For the FB15k-237 dataset, all parameters were also optimized using Adam, the epoch of the coding layer was 3000, the initial learning rate of the coding layer was set to 0.00001, the epoch of the decoding layer was 200, the initial learning rate of the decoding layer was set to 0.000001, the of the decoding layer was 64, and both the entity and relational embeddings of the last layer were set to 200.
4.3. Validation
In the link prediction task, the goal is to predict a triple; that is, to give the head entity as well as the relationship prediction tail entity or to give the head entity as well as the relationship prediction tail entity. We generated a set of (N-1) incorrect triples for each entity, replaced them with each other entity, and then assigned a value to each such triple. Subsequently, we sorted these scores in ascending order to obtain a ranking of correct triples . The experiment filtered all the duplicates in the setup, i.e., when ranking, we removed the triples that already existed in the training, validation, or test set. The whole process was repeated by replacing the tail entities and reporting the average metric. We report the mean inverse rank (MRR), mean rank (MR), and hit rate (Hits@N), e.g., N = 1, 3, and 10 for the first hit, third hit, and tenth hit, respectively.
4.4. Model Comparison
The proposed model, denoted as ConvFM, was compared with the previously methods: TransE, ConvE, ConvKB, DisMuit, CompIEX, and KB-GAT.
4.4.1. Comparison with the Translation Model Approach
TransE models the entities and relations of a triad by embedding them in low dimensions, but the initial embedding information of the translation model is not rich enough, and there is no mapping of entities to the relational space to extract the connections between entity relations. In contrast, the ConvFM model not only has the mapping of entities to the relationship space but also has a richer encoding of entity relationships.
4.4.2. Comparison with Multiplication Methods
DisMuit uses a bi-linear objective learning approach to capture relational semantics, and CompIEX models entities and relations using complex-valued embedding, but tensor decomposition models are generally prone to overfitting and have poor overall results. The ConvFM model, on the other hand, is based on convolutional neural networks and has better overall results.
4.4.3. Comparison with CNN Methods
ConvE uses a multi-layer convolutional network model for link prediction of triples, and ConvKB uses a convolutional neural network to capture the global relationships and transition features between entities and relationships in the knowledge base, but none of the above models takes into account the order problem among the triples themselves. The ConvFM model, on the other hand, takes into account the sequential inverse order problem of the triads to improve the performance of the model.
4.4.4. Comparison with the KB-GAT Model
Although the KB-GAT model has a higher quality of knowledge embedding and uses convolutional neural network as the base model, it does not consider the mapping problem between entity and relational relations and ignores the connections between entity relations. The ConvFM model, on the other hand, obtains better results by mapping the entities to the relational space and obtaining the subtle connections between entity relations.
The prediction results of the model for both datasets are displayed in
Table 2. These results clearly show that our proposed method achieved essentially the best results on both datasets, FB15k-237 and WN18RR.
5. Ablation Experiments
Given the strong experimental results of the proposed method, a closer examination of the specific contributions of its components is warranted. Knowledge embedding of multi-hop graph attention and one-way convolution are used as the basis of the model, and two-way convolutional networks, feature mapping, and R-drop were used as variables to set up controlled experiments.
First, in the experiments, all the triples were encoded through the coding layer to acquire the coded representation of entities and relationships. Next, experiments were conducted on two datasets by adding bi-directional convolution and feature mapping, and evaluation metrics such as Hits@1, Hits@3, Hits@10, and MRR were observed. The results showed that the model tends to stabilize after about 200 rounds of training. Thereafter, a control experiment was conducted, and the MR evaluation metrics of the three models on two datasets were observed. The experimental results show that after about 200 rounds of training, the models basically converge and achieve the optimal values among the five metrics.
This experiment used the coding result of the last epoch of the coding layer as the coded representation of entities and relations. After the feature mapping layer, the entities were mapped into the relationship space. In the decoding layer, the data of each evaluation index were counted every ten rounds. The results in
Figure 2 show that after 200 rounds of training, the three models basically converge.
Figure 2,
Figure 3 and
Figure 4 show the specific results of the three models on the two datasets except for the MR evaluation metrics.
Figure 5 shows the specific results of the three models on the two datasets for the MR evaluation metrics. The results show that the five metrics achieved optimal values at close to about 200 epochs.
When facing sparse and non-overlapping relations, our model was also able to show excellent performance. Multi-hop graph attention neural networks can provided significant advantages when dealing with these relations via the model’s mapping mechanism. Graph neural networks are particularly good at aggregating sparse relational links in the knowledge graph through a sparse representation of the adjacency matrix. In addition, through the mapping layer of the model, we can effectively identify and process non-overlapping relations and realize accurate mapping. From now on, we will also focus on conducted detailed experiments on sparse or non-overlapping relation processing to further validate our conclusions and continuously improve the performance of the model in these areas.
The experiments used the coding layer of the graph attention network and the one-way convolution as the basis of the overall model (ConGAT). In the case of the basic convergence of the model, the models of all three control experiments performed well, as shown by
Table 3, where the bi-directional convolution, feature mapping, and rDrop all led to different degrees of model improvement. The final experiment showed that the model ConGAT–BI-FM–rDrop achieved the best results.
6. Conclusions
This paper presents a link prediction model based on feature mapping and bidirectional convolution. The model first splices multi-hop and single-hop information and encodes it through a graph attention network to obtain an encoded representations of entities and relationships. Then, the model maps entities into the relationship coding space. Finally, the model splices the forward and inverse order triples and decodes them using a convolutional neural network, which is combined with r-drop for training. The model achieved good results on the WN18RR and FB15k-237 datasets.
However, the model has some limitations. First, the model uses a bi-directional convolutional and attentional mechanism during encoding and decoding, which may be limited by the localization of information transfer. Second, the model does not consider the encoding of multi-hop graph attention networks and the decoding of graph neural networks, which may limit the expressive power of the model. Therefore, in further research, the use of graph neural networks in the encoding and decoding process could be explored to improve the performance of the model.
Future research directions could consider using graph neural networks for decoding based on feature mapping and multi-hop graph attention network encoding to achieve consistency in encoding and decoding models. This combined use of the same network structure may improve the representation and generalization capabilities of the model and further enhance the performance of the link prediction task. In addition, other improvements could be explored, such as introducing more sophisticated attention mechanisms or using more advanced pre-trained models to enhance model performance. These research directions are expected to promote the development and application of link prediction models in practical applications.