1. Introduction
Relation classification is an important part of information extraction and is a supervised relation extraction [
1]. Its target task is to predict the relation between two entities from the text with well-labeled entities. Relation classification is an important step in constructing structured data. It is also an important basis for many tasks, such as text classification [
2], sentiment analysis [
3], question answering, etc.
Deep-neural-network-based methods have been widely used in relation classification. However, existing deep learning methods usually use the feature vector of whole sentences as semantic information for relation classification. This usually contains a lot of information that is not useful for relation classification, resulting in a model that does not accurately focus on the semantics of the relationships between entities. In addition, relation classification models usually use the Linear() function to map the final relational category probability distribution, and when a vector with a large dimension is mapped to a vector with only a dozen category dimensions, the semantic information that is helpful for relation classification is likely to be lost.
In the early days, methods such as word2vec [
4], GloVe [
5], and ELMo [
6] were used to generate word vectors, which were then used to extract deep semantics through neural networks. Ref. [
7] used convolutional neural networks to extract lexical-level features and sentence-level features and map them into high-level features for relation classification. Ref. [
8] applied recurrent neural networks to relation classification. Recurrent neural networks accumulate the semantics of sentences word by word, whereas convolutional neural networks must learn two local patterns and merge them. Obviously, the recurrent neural network model is more reasonable. Further, the semantic distribution formed by recurrent neural networks is smoother than that of convolutional neural networks. Although recurrent neural networks have excellent performance in extracting local features, they are somewhat weak in processing sequential data. Ref. [
9] used a Bidirectional Long Short-Term Memory network, which aims to address the fact that Bidirectional Recurrent Neural networks do not extract information with long-term dependencies well. Recently, pre-trained language models have had a significant impact in the field of natural language processing. Ref. [
10] used a pre-trained Bert model as a feature extractor and obtained quite good results by stitching the extracted features together and feeding them into the classifier. Ref. [
11] outlined an innovative approach to perform textual information extraction by using domain ontologies and language rules, which is experimentally proven to be a feasible approach.
Most of the above methods connect lexical-level features and sentence-level features to form the final feature vector for classification. However, sentence-level features cannot accurately represent the relation between two entities, and inevitably carry interference information. This will affect the final relation classification results. Moreover, existing methods put entity features and sentence features through simple splicing and then feed the features into Linear() function to obtain prediction results. Such a straightforward approach not only fails to make full use of feature information but also loses some semantic information.
In order to solve the above problems, we start from the perspective of “How can we accurately focus on the connection between two entities?”. We started to think and study deeply from this perspective. The existing approach is to remove the meaningless edge information and keep the meaningful core information by sentence compression. However, this will increase the computation and make the model become complicated [
12]. We find that the description of entity relations in a sentence exists between two entities in most cases. The words outside of this are meaningless information for the relation classification. Therefore, we truncate the sentence span between the two entities and merge the semantics of this span with the semantics of the whole sentence. The feature vector of “Strengthen Relational Semantics” is obtained. Using this feature vector, it is possible to greatly improve the model’s understanding of what kind of relationship should exist between a pair of entities, thus enabling the model to focus on the semantics of the relationship between entities. To be able to make full use of the semantic features extracted by the model, a completely new prediction structure was designed for relation classification. The various feature vectors extracted by the model are fed into this structure, and the final prediction results are obtained after Multi-Class Attention. In this way, we are able to capture not only the overall semantic information of the whole sentence but also focus on the relational information between entities. Various semantic features are fully utilized to enable the model to better handle the relational classification task. Therefore, to address the problems in existing methods, we propose a method called capturing relational span and using attention for relation classification.
The innovations in this paper are as follows:
1. A feature fusion method called “SRS (Strengthen Relation Semantics)” is proposed. We fuse the global information of the whole sentence and the relational information between entities to form the vector for strengthening relational semantics, which is used to solve the problem that existing methods cannot effectively focus on the relational semantics between entities.
2. A new attention-based prediction structure is designed. In our known work, we are the first to use full attention instead of fully connected layers to predict the probability distribution of each category for multiple classifications. With this prediction structure, we can make full use of various feature information and reduce the loss of semantic information.
2. Related Work
Traditionally, the two main approaches to relation extraction are supervised and semi-supervised. Supervised relation extraction, also known as relation classification, is the most effective and well-researched method that uses fully labeled manual data for training. The supervised relation extraction task does not have a subtask of entity recognition; so, the main structure of the model is a feature extractor + relational classifier. Remote supervision is a form of semi-supervision and is currently a solution to the lack of supervision data. However, there are several problems with remotely supervised data: Firstly, remote supervision of labeled data has a large number of errors. Secondly, it cannot solve the situation where a pair of entities contains multiple relations. Thirdly, the False-negative problem, where instances labeled as negative samples actually have relations, but such knowledge does not exist in the knowledge graph, leading to labeling errors. Most of the current research on remote supervision focuses on the first problem. The research in this paper focuses on supervised relation classification, and the main feature extraction networks for relation classification are the convolutional neural network, recurrent neural network, and attention-based mechanism of transformer [
13,
14], and Bert et al. [
15].
Convolutional neural networks: Ref. [
16] applied convolutional neural networks to achieve relation classification and proposed a convolutional DNN algorithm for extracting lexical-level features and sentence-level features. Positional features are also proposed to encode the relative distance between the current word and the target word pair. The model architecture proposed by [
17] is basically the same as the previous work; the biggest change is the replacement of the loss function. The innovation lies in the Ranking loss, which enables the model to consider not only the positive category score as high as possible but also the category score that is prone to misclassification as low as possible compared with the Softmax function. The disadvantage is still the defect of the model structure. Ref. [
18] proposed a new structure, the Augmented Dependency Path, which combines the shortest dependency path between two entities and a subtree connected to the shortest dependency path. By modeling the subtree using a recurrent neural network, a representation of the generated dependency subtree is appended to the words on the shortest dependency path so that the words on the shortest dependency path receive new word embeddings, and then a convolutional neural network is used to capture the key features on the shortest dependency path. The above methods all use convolutional neural networks as feature extraction frameworks, and all use the Linear() function to map the probability distribution of relational categories. Although convolutional neural networks are widely used in computer vision because of their excellent local information extraction ability, global-dependent information is very important in the field of NLP, especially in the field of relation classification. Convolutional neural networks have been phased out because of their shortcomings in extracting global information.
Table 1 shows a summary of our model and the convolutional-neural-network-based models.
Recurrent neural networks: Ref. [
19] proposed an RNN-based framework to model long-range relational patterns, and experiments demonstrated the capability of the RNN-based approach in remote pattern modeling. Ref. [
20] proposed a deep-learning relation classification model based on the shortest dependency path. They not only used a Bidirectional RCNN but also considered the problem of the directionality of relationships between entities. Ref. [
9] proposed the use of a Bidirectional Long Short-Term Memory network (BLSTM) to model sentences containing complete, sequential information about all words, achieving state-of-the-art performance at the time. Ref. [
21] added the attention mechanism to Bi-LSTM and proposed the AttBLSTM model to capture the most important semantic information in sentences. Ref. [
22] designed an attention-based BLSTM layer for converting semantic information into high-level features, and also proposed a new filtering mechanism to reduce noise. All of the above methods use recurrent neural networks or their variants of long and short-term memory networks as feature extraction frameworks, and all use the Linear() function to map the probability distribution of relational categories. Although recurrent neural networks are better able to process sequential data and extract global information compared with convolutional neural networks, they still have the disadvantages of being unsuitable for long sequences and prone to the problem of gradient disappearance.
Table 2 shows a summary of our model and the recurrent-neural-network-based models.
Attention mechanism: Ref. [
23] added the attention mechanism to CNNs. Two levels of attention mechanism are employed; the first one is applied to attention between individual word pairs in the input sequence, and the second one is applied to attention on the blending layer for the target category. Ref. [
24] proposed a Bert-based model to perform relation extraction without combining lexical and syntactic features, achieving SOTA performance and providing a baseline for follow-up. Ref. [
25] proposed a transformer-based relation extraction method TRE, which replaces the explicit linguistic features required by previous methods with implicit features captured in a pre-trained linguistic representation. Ref. [
26] added an additional MTB (Matching The Blanks) task to the pre-training process of BERT to improve the performance of relation extraction during the pre-training phase. Ref. [
27] introduced a dependency-based attention mechanism in the BERT architecture to learn high-level syntactic features. The dependency relation between each word and the target entity is considered, while different levels of semantic information are obtained by using the BERT middle layer to fuse multi-grain features for the final relation classification. Ref. [
28] used a fine-tuned BERT model to extract the semantic representation of sequences and then used segmental convolution to obtain the semantic information affecting the relation classification. The closest to our work is RBERT [
10], which also uses the pre-trained model BERT as a feature extractor to extract high-quality semantic features, and adds special symbols
$ and # in data pre-processing as a way to highlight entity vectors and facilitate the classification of entity relations. The above methods all use Transformer or BERT based on attention mechanism as the feature extraction framework, and all use Linear() function to map the probability distribution of relational categories. These methods were able to extract semantic-rich feature vectors using pre-trained models and are still improving the performance of relation classification models. However, these models do not have a systematic prediction structure and simply use the Linear() function to obtain the probabilities of each category in a brute-force manner, which results in the loss of feature vector semantics.
Table 3 shows a summary of our model and the attention-based mechanism models.
Based on the problems in the above methods, the model proposed in this paper uses SpanBert as a feature extraction framework to solve the problems of convolutional neural networks and recurrent neural networks in extracting features from sequence data. We use the “SRS” proposed in this paper to capture the relationship information between entities and reduce the interference of other irrelevant words. In this paper, a prediction structure is also designed for the model to make full use of various feature information and reduce the semantic loss of feature vectors. This is for works using remotely supervised methods to achieve relation classification, e.g., Ref. [
29], which used a contrast learning approach to aggregate features and reduce noise in the data. However, this paper mainly focuses on supervised relational classification tasks; so, remote supervision will not be described in detail. We aim to cover relation classification with remote supervision in future work.
4. Experiment
In this section, we verify that our proposed method can effectively capture relational information as well as fully exploit various semantic features. We conducted experiments on two publicly available datasets, and the specific experimental results and analysis are shown in
Section 4.2. The specific settings in the experiment are described in
Section 4.1. To better represent the specific role of the components in our approach, we performed ablation experiments on the SemEval-2010 Task 8 dataset (see
Section 4.3 for details).
4.1. Setup
Datasets: We used two publicly available datasets to validate the effectiveness of our method: The SemEval-2010 Task 8 dataset and the KBP37 dataset.
Table 4 shows the statistics of each dataset. The SemEval-2010 Task 8 dataset was provided by Hendrickx et al. as a free dataset, containing 10,717 samples total, 8000 samples for training, and 2717 samples for testing. The dataset contains nine relation types where the relations are ordered. The directionality of the relations effectively doubles the number of relations, since entity pairs are considered to be correctly labeled if the order is also correct. So, finally, there are 19 relations (2 × 9 + 1 other class). The KBP37 dataset includes 18 semantic relations and “no relation” classes. Similar to SemEval-2010 Task 8, the relations are directional; so, the actual number of relation types is 37. It contains 15,917 training instances and 3405 test instances.
Evaluation: We adopt the evaluation scheme of the relation classification criteria: precision, recall, and micro F1 are used as evaluation parameters. For the SemEval-2010 Task 8 dataset and the KBP37 dataset, only the prediction of the subject–object order between entities in the predicted relation is considered a positive sample. Otherwise, it is a negative sample (e.g., a sentence with a relation label Message–Topic (e1,e2), if the prediction is Message–Topic (e2,e1), would represent an incorrect prediction).
Platform Setup: The IDE used for the experiments in this paper is Pycharm2021 Professional Edition, PyTorch version 1.9.1, CUDA version 11.6, and CUDNN version 10.2. The model training and inference are performed on an NVIDIA A100-SMX with 40 GB GPU memory, and CPU memory of 16 GB.
Implementation details: For a fair comparison with previous work, we used the SpanBert-base uncased as our base encoder for extracting features from the corpus in the dataset. In addition to using SpanBert-base uncased as an encoder, we also designed a model based on SpanBert-large uncased as an encoder. We set the learning rate to
, the training bitch size to 32, the test bitch size to 16, the number of epochs to 50, and the dropout to 0.1. The detailed settings of the hyperparameters are shown in
Table 5 and
Table 6.
4.2. Experimental Results
This section will show the experimental results of our method on two publicly available datasets. On the SemEval-2010 Task 8 dataset, we used precision, recall, and micro f1 as our parameter metrics, and performed a full comparison experiment with previous methods. Additionally, to demonstrate the specific performance of our method on each relation class, we compared it in more detail with two, more powerful, current state-of-the-art models. To verify that our model has good generalizability, we conducted a comparison test on the KBP37 dataset using micro f1 as a parameter indicator.
Table 7 shows the experimental results of our model with previous models on the SemEval-2010 Task 8 dataset. Our models are divided into two types, and one is the base model with SpanBert as the feature extraction architecture. The other is the advanced model with spanbert_large as the feature extraction architecture. Previous models include GLFN; TRE; BERTEM+MTB; R-BERT; Att-RCNN; LGCNN; Bi-SDP-Att; MALNet; BERT with Entity, convolution, and max-pooling; and D-BERT. The experimental results prove that our method outperforms all other methods, where the values in bold represent the most advanced results in this metric. It can be observed that our base model achieves state-of-the-art results in both recall and f1 metrics. The advanced model with spanbert_large as the feature extraction framework substantially outperforms the existing models in all three metrics. Take the BERT with Entity, convolution, and max-pooling model as an example, where precision improves by 0.63 percentage points, recall improves by 0.44 percentage points, and f1 improves by 0.6 percentage points.
Table 8 specifically shows the score comparison of our base model with R-BERT and BERT-ECM (BERT with entity, convolution, and max-pooling) on the SemEval-2010 Task 8 dataset for precision, recall, and f1. We find that R-BERT only performs better in the Message–Topic relation, and in comparison with BERT-ECM, we find that BERT-ECM is slightly better than our model in precision; however, in recall and f1, our model outperforms BERT-ECM, especially in the recall metric. We attribute this to our proposed attention-based prediction structure. We set the paradigm of
q and
k to make full use of various features, which, in turn, facilitates the model to better identify the correct class of relationships. We also find that the score of f1 exceeds 90% for five relations, namely, Cause–Effect, Content–Container, Entity–Destination, Member–Collection, and Message–Topic. Our guess is that these five relations may be more in line with our SRS feature vector extraction paradigm, where the relationship between entities is hidden between two entities.
In the above experiments, it has been demonstrated that our model has a good ability to capture the semantics of relationships between entities and to make full use of various features. To further validate the generalizability of our model, we conducted experiments on the KBP37 dataset. The experimental results are shown in
Table 9. We selected some of the models from the above experiments for comparison. These include GLFN, R-BERT, Att-RCNN, LGCNN, MALNET, Bi-SDP-Att, and D-BERT models. The experimental results show that our model has good generalization. Both the basic and advanced models outperform existing methods. The base model with SpanBert as the feature extraction architecture achieves an f1 value of 69.33% on the KBP37 dataset, and the advanced model with spanbert_large as the feature extraction architecture achieves an f1 value of 69.55% on the KBP37 dataset.
4.3. Ablation Studies
We have demonstrated the effectiveness of our proposed approach. We would like to further understand the specific contribution of the proposed components. For this purpose, we designed ablation experiments. We designed experimental protocols without SRS and without predicted structures, respectively.
4.3.1. Role of SRS
In this section, we perform a specific analysis and experiment on the role of SRS, where we discard the span of the relation between entities and let the model no longer focus on the relation between entities. Thus, in
Figure 1, there is no more output of the green vector s and the prediction structure does not receive the input of s. We call this model CRSAtt_NO_SRS. We conducted corresponding experiments on the SemEval-2010 Task 8 dataset for CRSAtt_NO_SRS and the original model; the experimental results are shown in
Table 10. In the model with only the prediction structure and no SRS, the f1 of CRSAtt_NO_SRS decreases by 1.05. The experiments demonstrate that our proposed SRS feature fusion can effectively capture the relation features between entities.
4.3.2. Role of Prediction Structure
In this section, we perform specific analyses and experiments on the role of the prediction structure. We remove the prediction structure designed by ourselves and use the same method as RBERT to predict the classification by feeding it into the softmax classifier with only one fully-connected layer. Therefore, we directly stitch together the four vectors in
Figure 1 and feed them into the softmax classifier to predict the results after a fully connected layer. We call this model CRSAtt_NO_PR. We perform corresponding experiments on the SemEval-2010 Task 8 dataset for CRSAtt_NO_PR and the original model, and the experimental results are shown in
Table 11. The f1 of CRSAtt_NO_PR decreases by 0.9 compared with the full model. The experiments demonstrate that our proposed prediction structure can effectively fuse global semantic information and local relational information.
5. Conclusions
In recent years, Bert-based relation classification models have become increasingly popular with the rise of pre-trained models. However, these models cannot focus well on the semantics of the relationships between entities and rely only on the powerful feature extraction ability of the Bert model for relation classification. Therefore, in order to enable the model to focus on the semantics of relationships between entities, as well as to make full use of various feature information that facilitates relation classification. We propose a new relation classification model, CRSAtt, for solving the problem whereby existing models cannot accurately extract the semantics of relationships between entities by sentence-level features alone. We intercept the span between entities in a sentence during data pre-processing and use it as the semantics of the relationship between entities. The sentence and span are fed into the feature extraction architecture to extract various features that are fed into the attention-based prediction structure for relation classification. In order to have a better grasp of the semantic information of the relationships between entities, we propose a feature fusion method called SRS (Strengthen Relational Semantics), which aims to integrate global information and local relational information. In the prediction structure, we make our model perform particularly well on the recall metric by cleverly designing the way q and k are constructed. Experiments on the SemEval-2010 Task 8 dataset showed that the CRSAtt model improved performance over existing methods with an f1 score of 90.55%. In addition, the results of the ablation study on the SemEval-2010 Task 8 dataset show that our proposed SRS and attention-based prediction structures have a positive impact on the classification performance of the model.
In future work, the relation classification model needs to further improve its generalizability; so, we will train the model using a remotely supervised approach based on the research in this paper, focusing on introducing external knowledge with an aim to improve the generalizability and classification performance of the model using a large amount of data that do not require manual annotation.