MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features

Wang, Dongsheng; Tang, Kangjie; Zeng, Jun; Pan, Yue; Dai, Yun; Li, Huige; Han, Bin

doi:10.3390/sym16080961

Open AccessArticle

MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features

by

Dongsheng Wang

^1,*

,

Kangjie Tang

¹,

Jun Zeng

¹,

Yue Pan

¹,

Yun Dai

²,

Huige Li

¹ and

Bin Han

¹

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Department of Information Management, Jiangsu Justice Police Vocational College, Nanjing 211805, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(8), 961; https://doi.org/10.3390/sym16080961

Submission received: 2 July 2024 / Revised: 21 July 2024 / Accepted: 25 July 2024 / Published: 29 July 2024

(This article belongs to the Special Issue Advances in Computer Vision, Pattern Recognition, Machine Learning and Symmetry)

Download

Browse Figures

Versions Notes

Abstract

:

Multimodal knowledge graph completion necessitates the integration of information from multiple modalities (such as images and text) into the structural representation of entities to improve link prediction. However, most existing studies have overlooked the interaction between different modalities and the symmetry in the modal fusion process. To address this issue, this paper proposed a Transformer-based knowledge graph link prediction model (MM-Transformer) that fuses multimodal features. Different modal encoders are employed to extract structural, visual, and textual features, and symmetrical hybrid key-value calculations are performed on features from different modalities based on the Transformer architecture. The similarities of textual tags to structural tags and visual tags are calculated and aggregated, respectively, and multimodal entity representations are modeled and optimized to reduce the heterogeneity of the representations. The experimental results show that compared with the current multimodal SOTA method, MKGformer, MM-Transformer improves the Hits@1 and Hits@10 evaluation indicators by 1.17% and 1.39%, respectively, proving that the proposed method can effectively solve the problem of multimodal feature fusion in the knowledge graph link prediction task.

Keywords:

knowledge graph; multimodal features; link prediction

1. Introduction

In recent years, knowledge graphs have played a vital role in real-world tasks such as question answering systems [1,2,3], recommender systems [4,5], and information retrieval [6,7]. However, due to the issue of missing triples, knowledge graphs cannot encompass all knowledge. To address this problem, link prediction techniques have been introduced to predict possible missing relationship triplets [8] by using the symmetry between the complete knowledge graph and the real knowledge graph. Traditional link prediction methods, such as translation-based approaches [9] and neural network methods [10,11], often fall short in accurately capturing and processing the relationships between triples due to limitations in model design or the training process. Recently, some studies [12,13,14] have tackled this issue by enriching datasets and proposing new models to capture multimodal information. However, these methods typically project all modalities into a unified space, failing to effectively capture the complex interactions between modalities, thus limiting their performance. As a result, the multimodal link prediction task has been proposed to overcome these limitations.

Taking Figure 1 as an example, Stephen Curry and LeBron James were both born in Akron, Ohio. This similarity may lead to incorrect graph structure predictions, such as predicting that LeBron James is a player for the Golden State Warriors. Outdated predictions may also arise based on the text description “played for”. Visual features alone can only determine that LeBron James is a basketball player from his clothing but cannot determine which team he currently belongs to. However, by integrating multimodal knowledge such as geographic location and team information, it can accurately predict that LeBron James is a member of the Los Angeles Lakers. Inspired by multimodal representation learning [15], this paper fully leverages the graph structure, visual, and textual features of the knowledge graph and performs multimodal feature fusion based on the Transformer architecture to achieve link prediction on the knowledge graph. The contributions of this paper are summarized as follows:

(1): This paper proposes a method for fusing multimodal features, which makes full use of structural, visual, and textual features, extracts these features through a specific modality encoder, and uses a Transformer for multimodal feature fusion. This method can effectively reduce the heterogeneity of multimodal entity representations.
(2): By fusing feature information from different modalities at each layer, we can more comprehensively understand and represent entities and relationships in multimodal knowledge graphs and perform better in capturing the complex interactions of multimodal features.
(3): Through case analysis, multimodal feature fusion effectively reduces the bias that may be caused by single modal features. By analyzing the contribution of each feature to the final result, the interpretability and credibility of the model are enhanced.

2. Related Work

To achieve multimodal link prediction, researchers have proposed a series of multimodal pre-training methods and pre-trained them on a corpus of image-text pairs, such as ViLBERT [16], VisualBERT [17], and UNITER [18], which effectively improve the processing ability of multimodal information. In addition, Xie et al. [14] proposed to integrate image features into a typical knowledge graph representation learning model for multimodal link prediction. On the other hand, Sergieh et al. [13] and Wang et al. [19] jointly encoded and fused visual and structural knowledge through simple concatenation and using an autoencoder for multimodal link prediction, respectively. Existing multimodal link prediction methods mainly focus on encoding image features into knowledge graph embeddings. Xie et al. [14] extended TransE to obtain visual representations corresponding to knowledge graph entities and knowledge graph structural information, respectively. Sergieh et al. [13], Wang et al. [19], and Zhao et al. [20] further proposed several fusion strategies to encode visual and structural features into a unified embedding space. Recently, Wang et al. [21] studied the noise introduced by irrelevant images corresponding to entities and designed a forget gate with the MRP metric to select valuable images for multimodal knowledge graph completion.

In order to improve the performance of multimodal feature fusion, researchers have proposed a series of innovative methods. Shankar et al. [22] proposed a progressive fusion method that establishes connections between different layers of the model so that the information of the deep fusion can be used by the shallow layer, avoiding information loss while maintaining the advantages of late fusion. Liang et al. [23] proposed an information theory-based method to define the upper and lower bounds of multimodal interaction and showed how these bounds accurately reflect the true degree of interaction. Finally, they explained how these theoretical results can help evaluate model performance, guide data collection, and select appropriate multimodal models. Jiang et al. [24] proposed a new method called HyperRep, which uses hypergraphs to capture complex high-order relationships between different data types and combines the information bottleneck principle to improve the fusion effect of multimodal data. Golovanevsky et al. [25] proposed a new one-to-many (OvO) attention mechanism, which achieves linear expansion with the increase in the number of modalities and reduces computational complexity. Zhang et al. [26] proposed a method called MLA, which reduces the interference between different modalities by alternately learning a single modality and captures the interaction between different modalities by sharing parts, thereby improving the effect of multimodal representation learning. The experimental results on different tasks demonstrate the effectiveness and versatility of these methods.

However, although previous research on knowledge graph link prediction based on multimodal features has made some progress, it still has obvious limitations such as architectural universality and modal contradictions. It is necessary to propose a unified model to more effectively expand the application of multimodal knowledge graph completion and solve the contradictions in modal fusion.

3. Methodology

Formally, the knowledge graph [27] is defined as

G = < ℇ, R, T >

, where

ℇ

represents the set of entities,

R

represents the set of relations, and

T = {(h, r, t) | h, t \in ℇ, r \in R}

represents the relation triple of the knowledge graph. There are many kinds of relationships in the world. For example, a “marriage relationship” is symmetrical, while a “parent-child relationship” is asymmetrical; “hypernym and hyponym” is a reversal relationship; and “father’s father is grandfather” is transitive. The key is to find ways to model and infer these patterns from observed facts, including symmetry, antisymmetry, and reversal and transitive relationships, so as to predict the missing links. In the knowledge graph based on multimodal features, each entity is represented not only by text features, but also by structural features and visual features. Therefore, this paper defines the modal feature set of the entity

M = {s, v, t}

, where s, v, and t represent the structural modal features, visual modal features, and textual structural features of the entity, respectively.

Multimodal link prediction is one of the knowledge graph completion tasks. It is to predict the tail entity given the entity and relationship, expressed as (

e_{h}, r, ?)

. For a given entity

e_{h}

, according to the graph structure

S_{h}

and image

I_{h}

related to entity

e_{h}

, this paper models the distribution of the tail entity

e_{t}

, expressed as

p (e_{t} | (e_{h}, r, (S_{h}, I_{h})))

. First, feature extraction is performed on the structure, vision, and text; then, the entity representation of structure–text–image fusion is modeled, and finally the missing entity is predicted based on the multimodal entity representation.

3.1. Overall Architecture

As shown in Figure 2, the model MM-Transformer in this paper uses the basic Transformer model to model the entity representation of structure–text–image fusion. A graph neural network model, GraphSAGE (Graph Sample and Aggregation), is used at the bottom layer to extract the structural features of the knowledge graph, and the pre-trained models ViT (Vision Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are used to extract visual and text features. In the upper layer, an improved prefix-guided interaction module and a relevance-aware fusion module [28] are used to model the multimodal entity representation, and then link prediction is performed based on the multimodal entity representation.

3.2. Structural Feature Extraction

A knowledge graph is a graph-based data structure consisting of nodes and edges. Each node represents an “entity”, and each edge is a “relationship” between entities. A knowledge graph is essentially a semantic network. For example, if there is a relationship triple (entity 1, relationship, entity 2), then there will be a directed edge from entity 1 to entity 2 in the graph, indicating that there is a specified relationship between entity 1 and entity 2. Such a graph can capture the relationship and connection between entities well. Through the graph neural network model, we can learn the representation of nodes in the graph and calculate the importance of each node. For example, if a user has many friends, he may be more important in the social network. This can be achieved through the self-attention mechanism; that is, dynamically adjusting the weight according to the relationship between the user and his neighbor nodes, thereby reflecting the importance of the user in the network. In addition to the importance of nodes, we can also calculate the mutual relationship between nodes. For example, whether there is a direct friend relationship between two users, or whether there is an indirect association relationship between them (through common friends, etc.). This can be achieved by calculating the attention weights between nodes, so that the nodes are weighted and aggregated to reflect their mutual relationship in the network.

Assume that the nodes in the knowledge graph are represented as {

v_{1}, v_{2}, \dots, v_{k}

} and the edges are represented as {(

v_{i}, v_{j}

)}. GraphSAGE [29] is used as the structural encoder to calculate the structural features of each node

v_{i}

. For example, for a node representing a person, its structural features can include its position information in the graph. GraphSAGE is a graph neural network model for learning node representation in graph structures. It adopts a sampling and aggregation strategy to update the node representation by aggregating the node’s neighbor information. Each node can be represented as a vector of fixed dimension, which contains the structural information of the node and its position relationship in the graph. Specifically, the feature representation obtained using GraphSAGE refers to the abstract representation of each node in the graph structure. These representation vectors capture the node’s neighbor information, which can help this paper understand the relationship between nodes, discover hidden patterns and rules, and capture the relationship and connection between nodes in the graph.

Let

x_{i}

represent the hidden state of node

v_{i}

in the graph neural network model. The following formula can be used to calculate the hidden state

x_{i}

of node

v_{i}

:

x_{i} = σ (A G G (\{h_{j}^{(l - 1)} | \forall j \in N (v_{i})\}) \cdot W^{(l)}),

(1)

where

σ

is the activation function, AGG is the aggregation function,

h_{j}^{(l - 1)}

is the hidden state of node

v_{j}

in the previous layer

l - 1

,

N (v_{i})

represents the set of neighbor nodes of node

v_{i}

, and

W^{(l)}

is the weight matrix. GraphSAGE gradually updates the representation of nodes through multiple layers of aggregation operations to obtain the final structural feature output

X_{S} {\in R}^{k \times d_{S}}

. In each layer of aggregation operation, the neighbor information of the node will be aggregated and weighted to finally obtain a new table for each node.

3.3. Visual Feature Extraction

This paper adopts ViT [30] as a visual encoder to extract image features. The image representation process involves dividing the image into fixed-size blocks and rearranging the pixel values of each block into a vector form, which is provided as input to the ViT model. These block vectors are regarded as part of the input sequence and represent the local area in the image. The ViT model then transforms each block vector into a low-dimensional vector representation to capture the visual features of the image. This transformation is implemented by a fully connected layer that maps the pixel values of the block vector into a smaller feature space.

Specifically, the image is divided into e blocks of a fixed size. The original pixel value of a given block is represented as

Z \in R^{H \times W \times C}

, where

H

,

W

, and

C

represent the height, width, and number of channels of the image, respectively. Then, the original pixel values are rearranged into a one-dimensional vector

z \in R^{N \times C}

, where

N = H \times W

represents the total number of pixels in the block. Next, a linear transformation is applied to map the block vector

z

into a low-dimensional feature space to obtain the embedded block vector

x \in R^{N \times d_{V}}

. This linear transformation is expressed as follows:

x = z \cdot W + b,

(2)

where

W

is the weight matrix and

b

is the bias vector. In this paper, the tiles of

a

images are embedded and concatenated to obtain the visual feature output

X_{V} \in R^{m \times d_{V}}

, where

m = (e \times a)

.

3.4. Text Feature Extraction

This paper uses BERT [31] as a text encoder to extract semantic features of text, including word-level and sentence-level semantic representations, as well as context-aware and multi-level semantic information. The text is fed into the BERT model for processing, and each input text is converted into a sequence of words (tokens), represented as

X = {x_{1}, x_{2}, \dots, x_{n}}

, where

x_{i}

represents the

i

-th word. For each input word

x_{i}

, an embedding operation is performed to convert it into a vector representation. A word embedding matrix

E

is used, in which each row corresponds to the embedding vector of a word. The word embedding operation is expressed as follows:

E m b e d d i n g (x_{i}) = E_{x_{i}},

(3)

where

E_{x_{i}}

represents the embedding vector corresponding to word

x_{i}

. After the word embedding operation, the text feature output

X_{T} \in R^{n \times d_{T}}

is obtained.

3.5. Multimodal Feature Fusion

Existing multimodal feature fusion methods mainly focus on projecting the representations of different modalities into a unified space and exploiting the commonalities between modalities for prediction, but this approach may fail to preserve modality-specific knowledge. This paper alleviates this problem by fusing features from different modalities at multiple layers, thereby further leveraging their complementarity.

This paper uses the first

L_{S}

layer of GraphSAGE as a structural encoder to obtain the basic graph structure information of the knowledge graph, that is, the structural feature output

X_{S}

.

X_{S}

is used as the input of multi-head self-attention (MHA), and the structural representation is calculated as follows:

X_{0}^{S} = X_{S} + S_{p o s},

\bar{X_{l}^{S}} = M H A (L N (X_{0}^{S})) + X_{l - 1}^{S}, l = 1 \dots L_{S},

(4)

X_{l}^{S} = F F N (L N (\bar{X_{l}^{S}})) + \bar{X_{l}^{S}}, l = 1 \dots L_{S},

where

S_{p o s} {\in R}^{k \times d_{S}}

represents the corresponding position embedding and

X_{l}^{S}

is the hidden state of the

l

-th layer of the structure encoder.

This paper uses the first

L_{V}

layer of ViT as a visual encoder to obtain the basic feature information of the image, that is, the visual feature output

X_{V}

.

X_{V}

is used as the input of MHA, and the visual representation is calculated as follows:

X_{0}^{V} = X_{V} + V_{p o s},

\bar{X_{l}^{V}} = M H A (L N (X_{0}^{V})) + X_{l - 1}^{V}, l = 1 \dots L_{V},

(5)

X_{l}^{V} = F F N (L N (\bar{X_{l}^{V}})) + \bar{X_{l}^{V}}, l = 1 \dots L_{V},

where

V_{p o s} {\in R}^{m \times d_{S}}

represents the corresponding position embedding and

X_{l}^{V}

is the hidden state of the

l

-th layer of the structure encoder.

This paper uses the first

L_{T}

layer of BERT as a text encoder to obtain the basic feature information of the text, that is, the text feature output

X_{T}

.

X_{T}

is used as the input of MHA, and the text representation is calculated as follows:

X_{0}^{T} = X_{T} + T_{p o s},

\bar{X_{l}^{T}} = M H A (L N (X_{0}^{T})) + X_{l - 1}^{T}, l = 1 \dots L_{T},

(6)

X_{l}^{T} = F F N (L N (\bar{X_{l}^{T}})) + \bar{X_{l}^{T}}, l = 1 \dots L_{T},

where

T_{p o s} {\in R}^{m \times d_{S}}

represents the corresponding position embedding and

X_{l}^{T}

is the hidden state of the

l

-th layer of the structure encoder.

In the MHA of the Transformer [32] model, the input consists of query (Q), key (K), and value (V) vectors, obtained by three linear transformations. Therefore, this paper performs multi-head attention calculations on mixed keys and values in each layer of the attention layer to control the degree of interaction between structural features and visual features and text features to reduce modal heterogeneity in advance. The structural head

{h e a d}_{i}^{S}

, visual head

{h e a d}_{i}^{V}

, and text head

{h e a d}_{i}^{T}

in the original

{h e a d}_{i} = A t t e n t i o n (Q^{i}, K^{i}, V^{i})

are redefined, and the calculation process is as follows:

Q^{i}, K^{i}, V^{i} = x W_{q}^{i}, x W_{k}^{i}, x W_{v}^{i},

{h e a d}_{i}^{T} = A t t e n t i o n (x^{t} W_{q}^{t}, x^{t} W_{k}^{t}, x^{t} W_{v}^{t}),

{h e a d}_{i}^{S} = A t t e n t i o n (x^{s} W_{q}^{s}, [x^{s} W_{k}^{s}, x^{t} W_{k}^{t}], [x^{s} W_{v}^{s}, x^{t} W_{v}^{t}]),

(7)

{h e a d}_{i}^{V} = A t t e n t i o n (x^{v} W_{q}^{v}, [x^{v} W_{k}^{v}, x^{t} W_{k}^{t}], [x^{v} W_{v}^{v}, x^{t} W_{v}^{t}]),

This paper also derives a variant formula of Equation (7):

\begin{array}{l} {h e a d}_{i}^{m} & = A t t e n t i o n (x^{m} W_{q}^{m}, [x^{m} W_{k}^{m}, x^{t} W_{k}^{t}], [x^{m} W_{v}^{m}, x^{t} W_{v}^{t}]) \\ = s o f t m a x (Q_{m} {[K_{m}; K_{t}]}^{T}) [\begin{matrix} V_{m} \\ V_{t} \end{matrix}] \\ = (1 - λ (x^{m})) s o f t m a x (Q_{m} K_{m}^{T}) V_{m} + λ (x^{m}) s o f t m a x (Q_{m} K_{t}^{T}) V_{t} \\ = (1 - λ (x^{m})) \underset{s t a n d a r d a t t e n t i o n}{\underset{⏟}{A t t e n t i o n (Q_{m}, K_{m}, V_{m})}} + λ (x^{m}) \underset{c r o s s - m o d a l i n t e r a c t i o n}{\underset{⏟}{A t t e n t i o n (Q_{m}, K_{t}, V_{t})}} \end{array}

(8)

λ (x^{m}) = \frac{\sum_{i} {e x p (Q_{m} K_{t}^{T})}_{i}}{\sum_{i} {e x p (Q_{m} K_{t}^{T})}_{i} + \sum_{j} {e x p (Q_{m} K_{m}^{T})}_{j}},

(9)

where

x^{m}

represents the input of modality

m

,

m \in {s, v}

.

W_{q}^{m}

,

W_{k}^{m}

, and

W_{v}^{m}

represent the weight matrices of the query, key, and value of modality m, respectively.

λ (x^{m})

represents the hybrid parameter of the normalized attention weight, which is obtained by dividing the sum of the cross-modal attention score by the sum of the standard attention score and the cross-modal attention score. It can be dynamically adjusted according to the input features to ensure the best balance between standard attention and cross-modal interaction.

λ (x^{s})

is used to control the degree of interaction between structural features and text features, and

λ (x^{v})

is used to control the degree of interaction between visual features and text features.

In order to mitigate the negative impact of irrelevant structures and images on text, this paper conducts cross-modal interaction based on tags to achieve alignment between structure, text, and image. This paper uses k, m, and n to represent the sequence lengths of the structural vector

x^{s} \in R^{k \times d}

, the visual vector

x^{v} \in R^{m \times d}

, and the text vector

x^{t} \in R^{n \times d}

, which are the corresponding output features of multimodal features after MHA and fully connected feedforward network (FFN). For text tags, this paper calculates the similarity matrix with all structural tags and visual tags, respectively, and then performs softmax function processing, and performs average tag aggregation on the structural tags and visual tags. The specific process is as follows:

S = x^{t} {(x^{s})}^{T}, S = x^{t} {(x^{v})}^{T},

(10)

{A g g}_{i} (x^{s}) = s o f t m a x (S_{i}) x^{s}, (1 \leq i \leq n),

A g g (x^{s}) = [{A g g}_{1} (x^{s}), \dots, {A g g}_{n} (x^{s})],

(11)

{A g g}_{i} (x^{v}) = s o f t m a x (S_{i}) x^{v}, (1 \leq i \leq n),

A g g (x^{v}) = [{A g g}_{1} (x^{v}), \dots, {A g g}_{n} (x^{v})],

(12)

where

{A g g}_{i} (x^{s})

represents the structural representation of the similarity-aware aggregation of the

i

-th text token, and

{A g g}_{i} (x^{v})

represents the visual representation of the similarity-aware aggregation of the

i

-th text token. The structural and visual hidden states of the similarity-aware aggregation are then merged into the text hidden state in the FFN layer, calculated as follows:

F F N (x^{t}) = R e L U (x^{t} W_{1} + b_{1} + A g g (x^{s}) W_{3} + A g g (x^{v}) W_{4}) W_{2} + b_{2},

(13)

where

W_{1} \in R^{d \times d_{n}}

,

W_{2} \in R^{d_{n} \times d}

, and

W_{3} {\in R}^{d \times d_{k}}

represent the newly added parameters of the structural hidden states for aggregation, and

W_{4} {\in R}^{d \times d_{m}}

represents the newly added parameters of the visual hidden states for aggregation.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

This paper uses two publicly available multimodal link prediction datasets, namely, WN18-IMG: this dataset is an extended version of the knowledge graph WN18 [8] extracted from WordNet [33]. Different from the original WN18, each entity in WN18-IMG is accompanied by 10 images. This multimodal dataset provides richer information to support the link prediction task. FB15K-237-IMG: This dataset is a subset of Freebase [34], which is a large-scale knowledge graph that provides rich multimodal information to help the model understand the relationship between entities. Compared with the regular FB15K-237, each entity in FB15K-237-IMG [8] also has 10 images. Detailed statistics of these datasets can be found in Table 1.

During the evaluation process, we considered four effective entity metrics to measure the performance of the model which follow previous studies: (1) mean rank (MR) and (2) hit rate (Hits@1, Hits@3, and Hits@10).

4.1.2. Baselines

This paper compares several baseline models to fully demonstrate the superiority of this model. The baselines selected in this paper include the following:

VisualBERT [17], a pre-trained vision–language model with a single-stream structure.
ViLBERT [16], a pre-trained vision-language model with a two-stream structure.
IKRL [14], which extends TransE to learn the visual representation of entities and the structural information of knowledge graphs respectively.
TransAE [19], combines a multimodal autoencoder with TransE to encode visual and texture knowledge into a unified representation and uses the hidden layers of the autoencoder as the representation of entities in the TransE model.
RSME [21], which designs a forget gate with MRP metric to select valuable images for multimodal knowledge graph embedding learning.
MKGformer [28], a proposed hybrid Transformer model with multi-layer fusion to integrate visual and textual representations.

4.1.3. Experiment Details

This paper adopts a training method based on the Lightning framework. In order to ensure the reproducibility of the experimental results, this paper sets the random seed to 42. Since the task of this study is based on the knowledge graph, we chose the corresponding data class. For regularization, this paper chooses a smaller dropout, set to 0.1. During the training process, we used a batch size of 96, set the gradient accumulation step to 1, and set the number of working threads to 4, the learning rate to 4 × 10⁻⁵, the maximum sequence length to 64, and the validation batch size to 96. In order to improve the generalization ability of the model, this paper adopts the label smoothing technique and sets the label smoothing parameter to 0.3. We chose the pre-trained models ViT and BERT and used the binary cross entropy loss function (BCE) and the Adam optimizer. The computing equipment used in the experiment was configured as follows: the processor was Intel Core i5-12400F; the GPU was NVIDIA RTX 4070; the memory was 32 GB RAM; the operating system was Windows 10; and the deep learning framework was PyTorch 1.12.1. With this configuration, our model took about 41 h and 26 h and 30 min in the pre-training and training phases, respectively, for a total of about 67 h and 30 min.

5. Experimental Results

5.1. Overall Performance

The experimental results in Table 2 show that it is necessary to effectively integrate structural features, visual features, and text features to improve the link prediction performance of the knowledge graph. This shows that the model proposed in this paper is superior, especially in capturing multimodal information.

First, judging from the results on the FB15K-237-IMG dataset, MM-Transformer performs significantly better than other methods. Compared with the current multimodal SOTA method MKGformer, MM-Transformer improves the Hits@1 and Hits@10 evaluation indicators by 1.17% and 1.39%, respectively. In addition, MM-Transformer also performs better on the MR (Mean Rank) indicator, which is reduced to 215. Secondly, the performance on the WN18-IMG dataset is also outstanding. MM-Transformer’s scores on Hits@1, Hits@3, and Hits@10 are 0.948, 0.968, and 0.976, respectively, which are higher than RSME and other comparison methods. This further proves the universality and robustness of the model proposed in this paper on different datasets. These results show that MM-Transformer can more effectively integrate multimodal features, thereby improving the accuracy of link prediction.

Compared with multimodal methods that process different modal features separately, MM-Transformer combines multimodal features for joint learning, which helps to simultaneously consider the common characteristics and complementary relationships between different modal features. The experimental results clearly show that MM-Transformer has significant advantages in capturing and fusing multimodal features. This not only verifies the effectiveness of the model design, but also provides strong support for the use of multimodal fusion methods in knowledge graph link prediction tasks in the future.

5.2. Ablation Study

Table 3 presents the evaluation results of fusing different modal features for the FB15K-237-IMG dataset. Here, this paper uses S to represent structural features, V to represent visual features of images, and T to represent text features of descriptions. It can be observed that when structural, visual, or text features are introduced, the performance of the model is significantly improved. This shows the effectiveness of multimodal feature fusion.

As can be seen from Table 3, when text features (T) are used alone, the performance of the model is relatively low. This shows that when only text features are used for link prediction, the accuracy and ranking performance of the model are limited. When structural features (S) are added, the model performance is significantly improved. This shows that structural features play an important role in providing basic information about entities and relationships, which helps to improve the link prediction performance of the model. Similarly, when visual features (V) are added, the model performance is also significantly improved. This shows that the advantage of visual features in capturing image detail information enables the model to utilize more useful information in link prediction, thereby improving performance. Most notably, when structural features, visual features, and text features (S + V + T) are fused simultaneously, the model achieves the best performance. This shows that by fusing multimodal features, the model can make full use of the complementary information between different modal features, thereby significantly improving the accuracy and effect of link prediction.

Overall, the experimental results in Table 3 fully demonstrate the effectiveness of multimodal feature fusion. Structural features provide basic information about entities and relationships, visual features capture details in images, and text features provide semantic understanding. By jointly learning these features, the model can more comprehensively capture the interaction between different modal features, reflecting both commonalities and complementarity, thereby significantly improving the link prediction performance of the knowledge graph.

5.3. Visual Analysis

To further verify the effectiveness of multimodal feature fusion, this paper selects a case to visualize the prediction scores of each modal feature, as well as the prediction scores, after multimodal feature fusion. As shown in Figure 3, by comparing the prediction results of different modal features, we can more intuitively understand the advantages of multimodal fusion. When using structural features for prediction, the system obtains the highest score for “LeBron James plays for the Golden State Warriors”. This is because structural features mainly rely on the existing entity and relationship information in the knowledge graph. If the structural information in the knowledge graph is influenced by outdated data or other factors, it may cause deviations in the prediction results. When using text features for prediction, the system obtains the highest score for “LeBron James plays for the Cleveland Cavaliers”. Text features mainly rely on descriptive text information and may be affected by the source and context of the text. If the text information is outdated or contains misleading descriptions, the prediction results are also prone to errors. When using visual features for prediction, the system obtains the highest score for “LeBron James plays for the Los Angeles Lakers”. Visual features rely on image information. In this case, by identifying relevant images of LeBron James wearing a Los Angeles Lakers jersey, the system can make more accurate predictions. However, if the image data are not comprehensive or are noisy, the prediction results may be affected.

After fusing the features of different modalities, the system’s prediction score changed, accurately predicting “LeBron James plays for the Los Angeles Lakers”. The effects of multimodal feature fusion include the following: (1) Reducing bias: Single-modal features are often limited by the information bias inherent to that modality. The fusion of multimodal features integrates information from each modality, reducing the bias impact of single-modal features. (2) Enhanced robustness: By combining structural, textual, and visual features, the complementarity of different modal features makes the model more robust when facing various types of data. For instance, when there is a conflict between structural and textual features, visual features can provide additional judgment basis, thereby improving prediction accuracy. (3) Improved accuracy: By comprehensively considering the information from multimodal features, the model can better understand the connections between entities and relationships, leading to more accurate predictions. This has been verified in this case. The prediction result “LeBron James plays for the Los Angeles Lakers” after multimodal fusion is consistent with the actual situation, demonstrating the effectiveness of the model in this paper for multimodal feature fusion.

Through the analysis of this case, the limitations of single-modal features and the advantages of multimodal feature fusion are clearly demonstrated. Figure 3 intuitively displays the prediction scores of each modal feature and the prediction scores after fusion, further verifying that multimodal fusion can significantly enhance the performance of knowledge graph link prediction.

6. Conclusions and Future Work

This paper studies the problem of link prediction for knowledge graphs. Inspired by multimodal representation learning, we propose an effective model called MM-Transformer for knowledge graph link prediction. The model comprehensively utilizes structural, visual, and textual features. Specific modality encoders are employed to extract these features, and Transformer is utilized for multimodal feature fusion to reduce the heterogeneity of multimodal entity representations. Furthermore, multi-head attention calculations are conducted on mixed keys and values at each layer within the attention mechanism to proactively reduce modality heterogeneity. In the feed-forward network (FFN) layer, token-level fine-grained fusion is implemented to mitigate the adverse impact of irrelevant structures and images on textual information. This research is highly significant for advancing the field of knowledge graph completion and offers valuable insights for addressing other knowledge-intensive tasks. Although MM-Transformer shows excellent performance in the multimodal link prediction task, it has certain limitations. (1) Computational complexity: the model requires a lot of computing resources and is therefore not very practical to deploy in resource-constrained environments. (2) Longer training time: due to its complex architecture, the training time is significantly longer compared to simpler models. (3) Data dependency: the performance of the model depends heavily on the quality and quantity of available multimodal data, which may not always be accessible.

In the future, we plan to do the following: (1) extend the model in this article to pre-training tasks for multimodal knowledge graph construction. By pre-training on large-scale multimodal data, the generalization and robustness of the model can be further improved, providing richer and more accurate representations for knowledge graph completion; (2) further optimize the model architecture and training strategy of MM-Transformer to improve its efficiency and performance on large-scale datasets. Specifically, we plan to explore more effective fusion methods and attention calculation methods in terms of multimodal feature fusion and attention mechanisms. We also plan to (3) cooperate with researchers in other fields to apply the idea of multimodal representation learning to more practical scenarios, such as medical diagnosis, autonomous driving, and intelligent recommendation systems. Through cross-domain cooperation, the scope of applications of the model can be broadened and the development of related fields can be promoted.

Author Contributions

Conceptualization, D.W.; methodology, D.W., K.T., J.Z. and Y.P.; software, Y.D., H.L. and B.H.; validation, K.T., J.Z. and Y.P.; formal analysis, H.L. and B.H.; investigation, D.W., Y.D., H.L. and B.H.; resources, D.W., H.L. and B.H.; data curation, D.W. and K.T.; writing—original draft preparation, D.W.; writing—review and editing, K.T., J.Z. and Y.P.; supervision, Y.D., H.L. and B.H.; project administration, D.W., H.L. and B.H.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 61702234 and the Open Fund for Innovative Research on Ship Overall Performance, grant number 25422217.

Data Availability Statement

The dataset is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, 11–15 February 2019; pp. 105–113. [Google Scholar]
Yih, S.W.; Chang, M.W.; He, X.; Gao, J. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, Beijing, China, 26 July 2015; pp. 1321–1331. [Google Scholar]
Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 4623–4629. [Google Scholar]
Huang, J.; Zhao, W.X.; Dou, H.; Wen, J.R.; Chang, E.Y. Improving sequential recommendation with knowledge-enhanced memory networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, New York, NY, USA, 8–12 July 2018; pp. 505–514. [Google Scholar]
Zhang, N.; Jia, Q.; Deng, S.; Chen, X.; Ye, H.; Chen, H.; Tou, H.; Huang, G.; Wang, Z.; Hua, N.; et al. Alicg: Fine-grained and evolvable conceptual graph construction for semantic search at alibaba. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3895–3905. [Google Scholar]
Dietz, L.; Kotov, A.; Meij, E. Utilizing knowledge graphs for text-centric information retrieval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Tokyo, Japan, 8–12 July 2018; pp. 1387–1390. [Google Scholar]
Yang, Z. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; p. 2486. [Google Scholar]
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Adv. Neural Inf. Process. Syst. 2013, 26, 2787–2795. [Google Scholar]
Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence, Qutbec City, QC, Canada, 27–31 July 2014; pp. 1112–1119. [Google Scholar]
Nathani, D.; Chauhan, J.; Sharma, C.; Kaul, M. Learning attention-based embeddings for relation prediction in knowledge graphs. arXiv 2019, arXiv:1906.01195. [Google Scholar]
Nguyen, D.Q.; Nguyen, T.D.; Nguyen, D.Q.; Phung, D. A novel embedding model for knowledge base completion based on convolutional neural network. arXiv 2017, arXiv:1712.0212. [Google Scholar]
Pezeshkpour, P.; Chen, L.; Singh, S. Embedding multimodal relational data for knowledge base completion. arXiv 2018, arXiv:1809.01341. [Google Scholar]
Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 225–234. [Google Scholar]
Xie, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia, 19–25 August 2017; pp. 3140–3146. [Google Scholar]
Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
Wang, Z.; Li, L.; Li, Q.; Zeng, D. Multimodal data enhanced representation learning for knowledge graphs. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Zhao, Y.; Cai, X.; Wu, Y.; Zhang, H.; Zhang, Y.; Zhao, G.; Jiang, N. MoSE: Modality split and ensemble for multimodal knowledge graph completion. arXiv 2022, arXiv:2210.08821. [Google Scholar]
Wang, M.; Wang, S.; Yang, H.; Zhang, Z.; Chen, X.; Qi, G. Is visual context really helpful for knowledge graph? A representation learning perspective. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2735–3743. [Google Scholar]
Shankar, S.; Thompson, L.; Fiterau, M. Progressive fusion for multimodal integration. arXiv 2022, arXiv:2209.00302. [Google Scholar]
Liang, P.P.; Ling, C.K.; Cheng, Y.; Obolenskiy, A.; Liu, Y.; Pandey, R.; Salakhutdinov, R. Quantifying Interactions in Semi-supervised Multimodal Learning: Guarantees and Applications. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Jiang, Y.; Gao, Y.; Zhu, Z.; Yan, C.; Gao, Y. HyperRep: Hypergraph-Based Self-Supervised Multimodal Representation Learning. Available online: https://openreview.net/forum?id=y3dqBDnPay (accessed on 22 September 2023).
Golovanevsky, M.; Schiller, E.; Nair, A.A.; Singh, R.; Eickhoff, C. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. In Proceedings of the ICML 2024 Workshop on Efficient and Accessible Foundation Models for Biological Discovery, Vienna, Austria, 27 July 2024. [Google Scholar]
Zhang, X.; Yoon, J.; Bansal, M.; Yao, H. Multimodal representation learning by alternating unimodal adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27456–27466. [Google Scholar]
Li, X.; Zhao, X.; Xu, J.; Zhang, Y.; Xing, C. IMF: Interactive multimodal fusion model for link prediction. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2572–2580. [Google Scholar]
Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; Chen, H. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 904–915. [Google Scholar]
Gu, W.; Gao, F.; Lou, X.; Zhang, J. Link prediction via graph attention network. arXiv 2019, arXiv:1910.04807. [Google Scholar]
Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]

Figure 1. Example of knowledge graph link prediction by fusing multimodal features.

Figure 2. MM-Transformer model overall architecture.

Figure 3. Example visualization of prediction scores in multimodal feature fusion.

Table 1. Dataset statistics for multimodal link prediction.

Dataset	Ent	Rel	Train	Dev	Test
FB15K-237-IMG	14,541	237	272,115	17,535	20,466
WN18-IMG	40,943	18	141,442	5000	5000

Table 2. Results of the link prediction on FB15K-237-IMG and WN18-IMG. The arrows (↑ and ↓) next to each metric in the table indicate whether performance increases or decreases when we move from one model to another or from one dataset to another. ↑ indicates an improvement or increase in performance (higher values are better), and ↓ indicates an improvement or decrease in average ranking (lower values are better).

Model	FB15K-237-IMG				WN18-IMG
Model	Hits@1 ↑	Hits@3 ↑	Hits@10 ↑	MR ↓	Hits@1 ↑	Hits@3 ↑	Hits@10 ↑	MR ↓
VisualBERT_base [17]	0.217	0.324	0.439	592	0.179	0.437	0.654	122
ViLBERT_base [16]	0.233	0.335	0.457	483	0.223	0.552	0.761	131
IKRL [14]	0.194	0.284	0.458	298	0.127	0.796	0.928	596
TransAE [19]	0.199	0.317	0.463	431	0.323	0.835	0.934	352
RSME [21]	0.242	0.344	0.467	417	0.943	0.951	0.957	223
MKGformer [28]	0.256	0.367	0.504	221	0.944	0.961	0.972	28
MM-Transformer	0.259	0.362	0.511	215	0.948	0.968	0.976	117

Table 3. Evaluation results with different modality feature combinations on FB15K-237-IMG. The arrows (↑ and ↓) next to each metric in the table indicate whether performance increases or decreases when we move from one model to another or from one dataset to another. ↑ indicates an improvement or increase in performance (higher values are better), and ↓ indicates an improvement or decrease in average ranking (lower values are better).

FB15K-237-IMG
	Hits@1 ↑	Hits@3 ↑	Hits@10 ↑	MR ↓
T	0.241	0.345	0.457	248
S + T	0.242	0.351	0.386	232
V + T	0.256	0.367	0.504	221
S + V + T	0.259	0.362	0.511	215

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, D.; Tang, K.; Zeng, J.; Pan, Y.; Dai, Y.; Li, H.; Han, B. MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry 2024, 16, 961. https://doi.org/10.3390/sym16080961

AMA Style

Wang D, Tang K, Zeng J, Pan Y, Dai Y, Li H, Han B. MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry. 2024; 16(8):961. https://doi.org/10.3390/sym16080961

Chicago/Turabian Style

Wang, Dongsheng, Kangjie Tang, Jun Zeng, Yue Pan, Yun Dai, Huige Li, and Bin Han. 2024. "MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features" Symmetry 16, no. 8: 961. https://doi.org/10.3390/sym16080961

APA Style

Wang, D., Tang, K., Zeng, J., Pan, Y., Dai, Y., Li, H., & Han, B. (2024). MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry, 16(8), 961. https://doi.org/10.3390/sym16080961

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Architecture

3.2. Structural Feature Extraction

3.3. Visual Feature Extraction

3.4. Text Feature Extraction

3.5. Multimodal Feature Fusion

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Baselines

4.1.3. Experiment Details

5. Experimental Results

5.1. Overall Performance

5.2. Ablation Study

5.3. Visual Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI