3.1. Iterative Keyword Semantic Aggregator
The proposed iterative keyword semantic aggregator (IKSA) utilizes the information of the entity pair and global sentence information to capture a specific relation vector for each sentence, which is then used to remove noisy words and for semantic aggregation. Due to the fact that pairwise and global dependencies within the sentence ought to be jointly considered, we first utilize context-to-relation attention to explore the global dependency of each word on the entire sentence within the context of the relation vector and entity pair information, serving as an initial denoising step. Then, we aggregate the semantics of the keywords guided by this information, to form the distinctive sentence representation. Furthermore, we apply standard token-to-token self-attention to produce a context-aware representation for each token in light of its syntactic dependencies on other tokens from the same sequence, which is computationally expensive but necessary. Finally, we repeat the last two steps to iteratively derive the refined semantic relation features, and the detailed structure of the IKSA is shown in
Figure 2.
Input Representation. Tokens in sentences should be embedded into distributed representations for mathematical operations in neural networks [
40]. For the input tokens
in a sentence, where
and
represent the head entity and tail entity, respectively, we train the token
to vector
in a priori manner with the use of GloVe [
40].The parameter
indicates the dimension of the word. In addition, to encode the sentence in an entity-aware manner, relative position embedding [
17] is leveraged to represent the position information in the sentence. For example, the relative distances from the token “founder” to the head entity [Jeffery P. Bezos] and the tail entity [Amazon] are −2 and 2, respectively, in sentence S1 from
Table 1. Finally, the representation of an input token is the concatenation of word embedding and position embedding. We denote all the input tokens in a sentence as an input matrix
, where
and
m is the number of tokens in a sentence.
Capture Sentence-specific Relation Vector. Inspired by TransE [
41], which treats the embedding representation of the relationship between the two entities as a transformation of the embedding representations of the two entities:
, we argue that
can only approximate part of the relation between the two entities. However, the same entity pair may correspond to different relationships in different contexts, and the embeddings of the entities are fixed. Therefore, the IKSA module considers the potential relationships of entities within this context, to obtain a latent vector
r between the entity pair.
Specifically, we first perform a compression operation by leveraging the global average pooling, which retains the overall context information
,
where
indicates the
j-th dimension of the
i-th word’s features in the sentence input. Subsequently, the rough relation
and the global contextual information
S are concatenated with the embeddings of the entity pair to be the input of the entity-relation self-attention layer:
where
represents the tanh non-linear function,
is the weight matrix, and
is the dimension of the relation vector. Next, this section designs an entity-relation interaction layer, which uses a learnable relation query matrix
with
t relations to interact with
, obtaining a weight matrix
for the two entities regarding each relation:
In other words, each row of the matrix
R is a query vector, which is a representation of a specific type of relation. Since the potential relationship ought to consider both entities, a weight
is assigned to each relation:
The latent relation vector
r is obtained using the weighted sum of all relations, serving as a compact representation of relations for a specific sentence:
where
is the
i-th row of matrix
R. The resultant latent relation vector
r corresponds to the relation features of its sentence, and is the latent representation of the relationship expressed by the sentence, which will also be utilized in the forthcoming denoising and semantic aggregation processes.
Context-to-relation Attention. To exploit the global dependencies of each word in expressing a relation, we first utilize the acquired relation vector
r and the information of the entity pair to enhance the original input
X, transforming it into
,
. Following this is the operation to obtain the dependency of each token from the enhanced input:
where
and
represent two weight matrices, and
and
represent their bias terms, respectively, for calculating
. Accordingly, we leverage a sigmoid active function
to set the output in the range between 0 and 1, and
refers to the Gaussian error linear unit (GELU) function, as the input of neurons tends to follow a normal distribution. According to the dimension of
V, it computes a score for each feature of each word, so it can select the features that can best describe the word’s relational meaning in the enhanced sentence:
where ⊙ represents element-wise multiplication. Hence, such information is preserved in the output
for further relation feature extraction.
Relation Semantic Aggregation and Word Interaction. To obtain accurate relation features, we perform keyword semantic extraction and aggregation in this section. Additionally, we consider the specific meanings of the same word in different contexts through word-level interactions. In other words, after the semantic extraction of each keyword, we update the word vectors using sequence self-attention to more comprehensively form a sentence representation. Preliminarily, we project
and
into the vector space of word embeddings to serve as the two target vectors
for information aggregation. As a result, we select words that are relatively relevant to the target vectors. The relevancy matrix
is computed with the following operation:
For the sake of selecting the top
k tokens, we compute the weights
for each token with a relevance matrix using equation
. Meanwhile, the aggregated relation features are obtained:
It is necessary to note that the weights of the entity pair are naturally higher when calculating relevance, because an accurate relational feature must include the information of the entity pair along with the words describing the relationship. To avoid misleading the extractor, the aggregated
does not incorporate information from
or
in a weighted manner, as the relation vector contained is not inherently presented in the sentence itself. Then, the aggregated information is activated by two linear transformations with a ReLU activation in between:
where
and
are learnable parameters that keep
Q the same shape of
. To implement word interaction, multi-head self-attention (MHSA) is utilized to obtain the dependencies between every two tokens in
H. For clarity, we first give the definition of the dot-product attention mechanism:
where
Q,
K, and
V represent the query, key, and value, respectively. Note that they are all derived from
H through three different transformation matrices in IKSA:
,
,
. Subsequently, these three matrices can each be replaced by
n matrices of the same shape to form
n heads:
Based on the fact that the MHSA keeps the input shape the same as the output shape, when the word interaction is insufficient, we repeat the above operations in this section to obtain more comprehensive semantics for each word and more refined relation features. For example, the input of the
i-th relation semantic aggregation is
,
, and
. Correspondingly, the skip connection is taken into account of in the output of each operation:
The residual information can facilitate the deeper layer training of the extractor, and
i represents the
i-th iterative process. In addition, layer normalization is also considered during different iterations to stabilize and accelerate the training process. Eventually, a distinctive sentence representation
is fitted by a neural layer in following operation:
where
, and
and
are the final keyword semantic aggregations for the head and tail entities, respectively.
3.2. Multi-Objective Multi-Instance Learning
In this section, we present our proposed multi-objective multi-instance learning (MOMIL) module, as shown in
Figure 3, to alleviate the influence of sentence-level noise. We focus on handling multiple relations within a bag of sentences, taking into account both the bag labels and the potential relations of false instances.
Given a sentence representation produced by the IKSA, we select the instance that best matches the prediction relation r as the seed true instance. Intuitively, we argue that instances with a distance less than a certain threshold from the seed true instance express the same relation. In other words, instances that express the same relations can be clustered in a relation space. Under such an assumption, an appropriate threshold and a proper clustering algorithm are crucial. Otherwise, instances expressing different relationships might be clustered together, or other true instances might be missed. Therefore, we chose a tight threshold along with a greedy algorithm, as shown in Algorithm 1, to avoid omitting true instances.
Given two sentence representations
and
, we encode them into probability distributions
and
. We adopt a JS distance of
as the distance between
and
, which is computed using the Jensen–Shannon (JS) divergence:
The JS divergence is the symmetrized and normalized version of the Kullback–Leibler (KL) divergence:
whose advantages can make calculations and threshold adjustments more convenient. Specifically, the value of the JS divergence ranges from 0 to 1; the closer it is to 0, the more similar the two distributions are, while the value of the KL divergence has no upper limit.
With the representation assignment algorithm, all instances in a bag are categorized into true and false sets. Referring to the concept from Zhou et al. [
42], semantic enhancement is applied to true instances to distill more similar relation features in these sentence representations, forming an ultimate accurate bag representation for training. The weight of each instance is determined based on its correlation with other instances:
where
k is the size of set
V,
denotes the Euclidean norm (or 2-norm) of a vector, and the bag representation
is obtained using the weighted sum of these true instances after softmax. In this way, features that are less relevant to the relation will be further filtered out. Unlike previous methods that directly assign weights to sentences in a bag, we refine the features after identifying the set of true instances. Our approach prevents any single sentence that best aligns with the label relationship from dominating with an excessively high weight. Instead, we focus on extracting similar features from the set of positive instances, leading to a more even distribution of weights and, consequently, a more comprehensive and accurate representation of the bag.
Algorithm 1 Representation assignment |
- Require:
sentence representations in a bag , threshold - Ensure:
true instances set V - 1:
Add the most possible true instance to set V - 2:
Initialize the queue Q with - 3:
while do - 4:
Dequeue the first element from Q - 5:
Compute the distances between and instances in - 6:
if then - 7:
Add the corresponding to V - 8:
Enqueue into Q - 9:
end if - 10:
end while
|
Relation Prediction. To make the use of comprehensive bag feature a fully connected layer, a
activation function is adopted, which aims to perform a nonlinear transformation and map
to the relation prediction space
o:
where
and
represent the weight and the bias. Then, a softmax classifier is utilized to predict the entity relation
:
We define the objective of classification using a cross-entropy function, as follows:
where
M is the number of bags.
Cross-level Contrastive Learning. For false instances, we implement cross-level contrastive learning to exploit useful information within them. Previous works have shown the effectiveness of utilizing cross-level information. Thus, for the
i-th false instance
, we chose the representation of other bags
as its negative pair, since is has been denoised and naturally has a different relation triple from false instances within the original bag. However, it is not clear to which other instance expresses the same relationship as the false instance
when constructing a positive pair. Therefore, we generate a positive instance with dynamic gradient perturbations
to solve this issue, since a previous work [
43] proved its effectiveness in creating a pseudo-positive sample with minimal deviation from the original sample, making sure that they are sufficiently similar:
where
is the gradient from the loss function,
is a hyperparameter that regulates the degree of disturbance.
Figure 3 shows an example of how to construct a positive pair and negative pair in a bag. We define a objective using InfoNCE [
23] loss for the representation
:
where sim
indicates a cosine function to measure the similarity between two sentence representations. Through such a design, we can bring the sentence representations of the same relational triples closer together, while pushing the representations of different relational triples further apart.