1. Introduction
Obtaining useful information from the vast amount of medical resources is one of the main problems facing modern healthcare. Information extraction is a fundamental step in text analysis [
1]. Information extraction includes named entity recognition, relation extraction, and event extraction [
2]. Medical entity relation extraction is the classification of relation categories between entity pairs in unstructured medical texts. These relations exist in the form of triples (<subject, predicate, object>), which are called entity relations triples. Relation extraction is the key and difficult part of information extraction.
With the rapid development of biomedical text information extraction technology, there are more methods for relation extraction tasks [
3]. Early studies used dictionary-based and medical domain-related knowledge bases to manually construct rule templates to accomplish relation extraction of medical entities [
4]. Later, some scholars applied machine learning methods to medical text relation extraction and regarded the relation extraction task as a classification problem to recognize the relation between entities [
5]. Recently, deep learning methods have been most widely applied in medical relation extraction, with recurrent neural networks (RNNs) [
6], convolutional neural networks (CNNs) [
7], and pre-trained language models being the mainstream neural networks currently used for relation extraction.
Although many methods have achieved good results in relation extraction tasks, there are still many difficulties in Chinese medical relation extraction tasks. Chinese medical texts have flexible expressions, complex sentence structures, and different methods of text analysis. Each Chinese sentence is not separated by a separator; rather, a series of consecutive Chinese characters are connected into a sentence. It is a crucial task to correctly divide words according to the semantics. Errors in Chinese word cutting can greatly affect the results of relation extraction. At present, Chinese word-cutting methods include classical mechanical word-cutting [
8], statistical word-cutting [
9], and neural network methods [
10]. For example, Yuxuan Lai et al. [
11] proposed a novel Chinese pre-training paradigm, Lattice-BERT, which explicitly combines word representations with characters so that sentences can be modeled in a multi-granularity manner.
Currently, relation extraction methods can be divided into two categories: sequence-based and dependency-based. Sequence-based approaches use only word embeddings as input to the model, while dependency-based models merge dependency trees into the model. Graph convolutional networks (GCNs) can extract spatial features on topological graphs to learn the information on the whole graph [
12]. In existing studies, syntactic information is also widely used in relation extraction tasks. Syntactic dependency relations have better semantic guidance and entity relation information of sentences [
13]. Compared with the sequence-based approach, the dependency-based approach can better obtain non-local entity relation information from sentences.
According to the above analysis, the Chinese medical relation extraction model needs to learn the sequence information of sentences. Apart from that, syntactic dependency information should be considered. Therefore, this paper proposes a Chinese medical relation extraction model, BAGCN (BiLSTM + Attention + GCN), based on syntactic dependency structure information. The model captures sentence-dependent structural information and sequence information through a graph convolutional neural network (GCN) and a bidirectional long short-term memory neural network (Bi-LSTM). In addition, we incorporate a new pruning operation in the model, considering the effect of noise on the dependency information. Finally, the model applies a multi-head attention mechanism to learn entity-related information from different perspectives. In this way, we make full use of the sequence information and dependency information of sentences to extract entity-relation triples.
The main contributions of this paper are as follows:
(1) The model constructs each sentence as syntactic dependency trees to learn the information of sentences. The dependency tree contains syntactic information and relation structures between words in a sentence. The hidden features of entity relations in the sentence can be fully explored by learning the dependency relations.
(2) The model combines BiLSTM and GCN to extract feature information together. BiLSTM can learn sentence sequence features at a shallow level, and GCN can fully learn node information in the dependency relation graph. By combining BiLSTM and GCN, the model can better learn the global feature information of the sentence.
(3) The model adopts a novel pruning strategy to remove the noise in the dependency tree. In this paper, the shortest dependency path between two entities in the dependency tree is constructed as the shortest path tree. The nodes connected to the head and tail entities form the local dependency tree. Then, the shortest path tree and the local dependency tree are combined to construct the final pruned tree. This pruning method both removes the redundant information in the sentence and retains the important information.
(4) The model introduces a multi-head attention mechanism to learn multi-perspective semantic information of sentences. The multi-head attention mechanism can automatically learn the importance and relevance of words in a sentence based on contextual information and multi-dimensional spatial information, further improving the performance of the relation extraction model.
3. Model
Medical sentences are often complex, with a large specialized vocabulary and long-range dependencies between entities, and they often contain important syntactic information. To better learn dependencies between entities and use sentence structure information, this paper proposes a relation extraction model based on syntactic dependency structure information. First, the model performs vector embedding of the input text and transforms the sentences into three vectors (character vector, lexical feature vector, and entity feature vector) for stitching to better represent the sentence information in the semantic encoding stage. Second, the input vectors are encoded and learned by the BiLSTM layer to obtain the sequence information of the sentences. Third, the model constructs each sentence as a syntactic dependency tree by the LTP tool. Then, we perform a pruning operation on the obtained dependency tree, and the pruned dependency tree is transformed into a graph structure. Fourth, the sequence information and graph information are jointly transferred to the graph convolution layer for convolution operation. In addition, a multi-head attention layer is added after the graph convolution layer to learn the weights of different entities. Finally, the output information is transferred together to the relation classification layer for relation classification. The model consists of an input layer, a BiLSTM layer, a GCN layer, a multi-head attention layer, and a classification layer. The model diagram is shown in
Figure 1.
3.1. Input Layer
3.1.1. Corpus Pre-Processing
How to transform unstructured medical texts into graph structures is the basis for learning using graph convolutional neural networks. The model requires pre-processing of the experimental corpus. First, the Chinese Language Technology Platform (LTP) [
33] developed by the Harbin Institute of Technology was used for analysis and processing of the input sentences. The tool provides Chinese language processing modules such as Chinese word separation, lexical tagging, syntax, and semantics. The model input in this paper consists of a character vector, a lexical feature vector, and a type feature vector. The character vectors are obtained by training from Chinese dictionaries. The lexical feature vectors are obtained by word tagging of the input sentences by the LTP tool. Since there are different types of relations in the corpus, we added type features to improve classification. The type feature vector is set according to the entity type. If a character belongs to a certain entity after Chinese word segmentation, it is set to that entity label; otherwise, it is set to UNK.
Syntactic information contains important grammatical information in a sentence. Especially in the medical field, there are pairs of entities in a sentence that are closely connected, and some pairs of entities that are distant from each other. The dependency relation tree constructed by syntactic analysis can provide long-distance connections between words and can also build a graph structure of all entity relations in a sentence. When converting the input text into a graph structure, this paper constructs a dependency relation tree of the sentences by syntactically parsing the input sentences with the LTP tool. Dependency trees focus on the grammatical relations between words in a sentence. In addition, dependency trees can constrain grammatical relations into a tree structure. In a sentence, if a word modifies another word, the modifier is called dependent, the modified word is called head, and the grammatical relationship between the two is called dependency relation. In the dependency tree, the direction of the arrow is from the head to the dependent. The dependency tree is obtained by representing all word dependencies in a sentence in the form of directed edges. The graph convolutional neural network can learn the syntactic information of sentences from the dependency relations. Syntactic analysis can learn well the grammatical structure of sentences and the dependency between words according to the content of sentences. The dependency tree obtained by parsing a given sentence is shown in
Figure 2:
After obtaining the dependency relation tree, the model needs to transform the tree structure into a form that can be computed by a graph convolutional neural network. Usually, standard graph convolutional neural networks are constructed based on word dependencies and are represented by adjacency matrices. The adjacency matrix can represent the relations between vertices in the graph and also store edge information. If a dependency diagram,
, where
denotes the vertices in the graph, and
denotes the set of edges in the graph. We use the adjacency matrix
to represent the dependency graph. As shown in Equation (
1), when
or when there is a connection between nodes
and
in the dependency tree,
, otherwise
.
where
denotes the edge between node
and
in the dependency tree or 0 if no edge exists. According to Equation (
1), we can transform the dependency tree of
Figure 2 into an adjacency matrix, as shown in
Figure 3.
3.1.2. Pruning Operation
To construct the dependency graph, we construct a dependency tree for each input sentence. The model captures long-range word relations and hidden sentence features in sentences through dependency trees. However, most studies nowadays are affected by the noise in the dependency trees, and excessive use of dependency information may confuse the relation classification. In particular, there is a lot of noise in the automatically generated dependency trees, and ignoring this noise can impact the results and computational complexity. Therefore, it is necessary to prune the dependency trees.
Usually, pruning operations use distributed methods to resolve noise, such as lowest common ancestor subtree pruning (a subtree is formed by using the common node closest to two entities as the root) and shortest dependency path tree pruning (preserving the shortest path between two entities in the dependency tree). In the relation extraction task, the dependency structure of sentences contains rich information, but there is also redundant information in the complete dependency tree. Useless information in the dependency tree can interfere with the model, but pruning operations on the dependency tree may ignore some important information in the sentence. In order to remove the nodes with irrelevant information in the dependency tree while effectively using some important information in the sentences, a new pruning strategy is proposed in this paper.
Although the lowest common ancestor subtree and shortest dependency path methods can remove some useless nodes, there is a possibility that some dependency information will be lost during pruning, and that even critical information will be lost. As shown in
Figure 4, we propose a combination of a local subtree and the shortest path tree to construct the input graph. The local subtree contains all the dependencies directly connected to the head entity and the tail entity. The shortest path tree contains all the dependencies on the shortest path between two entities. In a complete dependency tree, the path from the root node through the least number of nodes to the head and tail entity nodes is the shortest path. The shortest dependency path can effectively represent the structure of semantic relationships between entities, and the path contains the lexical information on the path between the root node and the head and tail entity nodes. In the pruning operation, we retain all node relations contained in the local subtree and all node relations contained in the shortest path tree. We consider the words removed from the sentence by the two pruning operations as noise, and the words retained by the two pruning operations as actual retention. The dependency tree is pruned into two subtrees by two pruning operations, and then, the final dependency tree is formed based on the node dependencies retained by the two subtrees. When transforming the pruned dependency tree into an adjacency matrix, the corresponding adjacency matrix value is set to 1 for the retained nodes and 0 for the deleted nodes. As shown in
Figure 5, our final dependency relation graph is composed of two different dependency relation graphs, which allows the dependency relation graph to have reduced noise while retaining valid information.
3.2. BiLSTM Layer
BiLSTM can acquire the features of the context, and in order to fully learn the sentence information, this paper uses the BiLSTM layer to encode and model the sentences. The input of the BiLSTM layer is composed of word vector representation
, lexical feature representation
, and type feature representation
together. In the relation extraction task, a sentence may contain multiple entity types. The type feature representation can help the model identify the target entities more accurately. For the input sentence
, the input of BiLSTM is represented by three feature vectors, as shown in Equation (
2).
For a joint embedding vector
at any position
i in the input sequence, the LSTM will combine
and the state
from the previous moment to calculate the hidden state
at the current moment. BiLSTM can effectively memorize the context information by setting two independent hidden layers. Finally, calculate the forward representation
and backward representation
of any input
to obtain the final
i moment hidden state
. The hidden state
contains both the sentence forward information, the backward information of the sentence, and the current input
. Thus, BiLSTM can learn sentence bidirectional semantic information better.
3.3. GCN Layer
GCN is a simple and effective graph-based convolutional neural network that learns information for graph nodes containing all neighboring nodes and its own nodes, as shown in
Figure 6. GCN acts directly on the graph, and its inputs are the graph structure and the feature representation of the nodes in the graph. The model proposed in this paper learns the dependency information in the input sentence dependency relation tree by a graph convolutional neural network.
Firstly, the sentences are preprocessed, and the feature vectors obtained by encoding the word segmentation through the BiLSTM layer are used as nodes in the graph. The relations between different nodes in the results of the dependency analysis are then used as edges that constitute the graph structure of the graph convolutional neural network. To reduce the effect of noise in the sentences, the model uses a pruning strategy on the dependency tree and transforms the pruned dependency graph into the adjacency matrix A.
Based on the adjacency matrix
A, for each node
, the GCN at layer
l learns the node information on the dependency relation tree and calculates the output of node
at layer
l as
. The specific calculation is shown in Equations (
4) and (
5):
where
represents the initial embedding of node
,
represents the original feature of node
,
represents the hidden state of node
after
layer graph convolutional neural network,
represents the weight matrix,
represents the bias,
represents the corresponding elements of node
i and node
j in the adjacency matrix,
represents a nonlinear function and is the ReLU function in this model, and
represents the hidden state of node
after the
l layer convolutional neural network. For each layer of GCN, functions
, matrix
, and matrix
are shared on all nodes, which makes the number of parameters on the model irrelevant to the graph, enabling the GCN model to be well-extended.
3.4. Attention Layer
The self-attention mechanism can learn the internal structure of sentences and can learn the weight between every two nodes according to the correlation information between words. However, the weight vector usually obtained by the self-attention mechanism can only represent one-sided information of the sentence. Medical texts have a high entity density distribution, and for a sentence, there may be multiple aspects of semantic information that together constitute the overall information of the sentence. To be able to capture the dependency information between each node in the graph structure in a multi-dimensional way, this model uses a multi-head attention mechanism to learn the weight information between nodes from different semantic spaces.
The calculation of the attention mechanism involves a set of vectors
. Firstly, the current input
h is multiplied by three independent parameter vectors
. Then, the attention is calculated by zooming the dot product, as shown in Equation (
6):
where
represent the query matrix, key matrix, and value matrix, respectively,
is used as the scaling factor, and
normalizes the result of
.
The multi-head attention module contains multiple heads for parallel computation, and
N is the number of heads.
, and
V will be mapped
N times independently using different parameter matrices, and then will be input to
N parallel heads for attention computation. Finally, the attention results of the N heads are stitched together and then linearly transformed to obtain the final output. The specific calculation process is shown in Equations (
7) and (
8).
where
are the parameter matrices used in the linear mapping,
denotes the
i-th head attention module, and
is the splicing multi-head operation.
3.5. Classification Layer
After the attention layer, the model has sufficiently learned the complete information of the sentence. The role of the relation classification layer is to classify the relations between entities based on the learned information. The model input contains entity type labels, and span mask vectors for the head and tail entities are also input to the model. In this paper, the vector representation of the sentence
is obtained by doing the maximum pooling function on the output vector representation
of the attention layer. The same maximum pooling function is also used to obtain the entity vector representations
and
from
. The maximum pooling function is calculated as follows:
where
denotes the maximum pooling function, and
and
represent the span of the head and tail entities, respectively.
The model concatenates the vector representation of sentences and the vector representation of entities to transmit them to the fully connected layer to obtain the final representation. Finally, the relation probability distribution of entity pairs is predicted by the
function. The calculation is shown in Equations (
12) and (
14).
where
r denotes the total relation type,
is the parameter matrix,
is the bias term, and
T denotes the final prediction label.
5. Conclusions
In this paper, we propose a Chinese medical relation extraction method based on syntactic dependency structure information. Compared with previous approaches, our model learns sequence information through BiLSTM. The model also captures syntactic dependency information in sentences through GCN, thus learning sentence information more comprehensively. In addition, we propose a new pruning operation for pruning dependencies. Finally, the model also incorporates a multi-head attention mechanism to learn sentence information from different semantic spaces. The experimental results show that our BAGCN model outperforms the baseline model on the Chinese medical entity relation extraction dataset. In addition, it is illustrated through experiments that syntactic dependency information is important in the relation extraction task.
However, there are still many difficulties in the task of Chinese medical entity relation extraction. Our model is still deficient in predicting complex medical entity relations. Our future work will focus more on relation extraction in more complex medical texts, such as document-level relation extraction and medical event relation extraction.