A Novel Rational Medicine Use System Based on Domain Knowledge Graph

Qin, Chaoping; Wang, Zhanxiang; Zhao, Jingran; Liu, Luyi; Xiao, Feng; Han, Yi

doi:10.3390/electronics13163156

Open AccessArticle

A Novel Rational Medicine Use System Based on Domain Knowledge Graph

by

Chaoping Qin

¹,

Zhanxiang Wang

²,

Jingran Zhao

²,

Luyi Liu

^3,*,

Feng Xiao

² and

Yi Han

²

¹

Wuhan University of Technology Hospital, Wuhan University of Technology, Wuhan 430070, China

²

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

³

National Science Library (Wuhan), Chinese Academy of Sciences, Wuhan 430071, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3156; https://doi.org/10.3390/electronics13163156

Submission received: 27 July 2024 / Revised: 5 August 2024 / Accepted: 8 August 2024 / Published: 9 August 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Medication errors, which could often be detected in advance, are a significant cause of patient deaths each year, highlighting the critical importance of medication safety. The rapid advancement of data analysis technologies has made intelligent medication assistance applications possible, and these applications rely heavily on medical knowledge graphs. However, current knowledge graph construction techniques are predominantly focused on general domains, leaving a gap in specialized fields, particularly in the medical domain for medication assistance. The specialized nature of medical knowledge and the distinct distribution of vocabulary between general and biomedical texts pose challenges. Applying general natural language processing techniques directly to the medical domain often results in lower accuracy due to the inadequate utilization of contextual semantics and entity information. To address these issues and enhance knowledge graph production, this paper proposes an optimized model for named entity recognition and relationship extraction in the Chinese medical domain. Key innovations include utilizing Medical Bidirectional Encoder Representations from Transformers (MCBERT) for character-level embeddings pre-trained on Chinese biomedical corpora, employing Bi-directional Gated Recurrent Unit (BiGRU) networks for extracting enriched contextual features, integrating a Conditional Random Field (CRF) layer for optimal label sequence output, using the Piecewise Convolutional Neural Network (PCNN) to capture comprehensive semantic information and fusing it with entity features for better classification accuracy, and implementing a microservices architecture for the medication assistance review system. These enhancements significantly improve the accuracy of entity relationship classification in Chinese medical texts. The model achieved good performance in recognizing most entity types, with an accuracy of 88.3%, a recall rate of 85.8%, and an F1 score of 87.0%. In the relationship extraction stage, the accuracy reached 85.7%, the recall rate 82.5%, and the F1 score 84.0%.

Keywords:

domain knowledge graph; named entity recognition; entity relationship extraction; auxiliary medication

1. Introduction

In the field of healthcare management, medication has always been a critical and challenging issue. According to the World Health Organization, the rational use of medicines involves administering medicines in a manner that ensures their safety and effectiveness, at the right time, with accurate dosages, correct usage methods, and within a proper treatment period, all at a cost affordable to the public. In practice, the large number of prescriptions, the complexity of medication regimens, and the ever-increasing variety and information on medicines can lead to prescription errors or inefficient prescription reviews due to the limitations in professional knowledge of doctors and pharmacists. Therefore, establishing a safe and reliable automatic medication review system is crucial. An excellent medication assistance system is a fundamental component of clinical rational medicine use in medical institutions. It can systematically regulate the dosage, types, and frequency of medicine use, providing pharmacists with essential medicine information to aid in the review process. The implementation of such a system relies heavily on the support of medical knowledge graphs. These graphs integrate medical data, medicine information, and treatment plans into a computer system, offering users reliable medical knowledge. They enable dosage queries and medication reviews based on actual patient conditions, potentially improving decision-making efficiency, reducing medical errors, minimizing economic losses, and thus enhancing healthcare quality. This ensures safer and more effective medical services for patients.

As the demand for specialization in current applications increases, the quality and depth of knowledge provided by general-domain knowledge graphs are insufficient to meet specialized custom requirements. Consequently, research focus is shifting from general-domain knowledge graphs to domain-specific knowledge graphs. The construction of Rational Medicine Use Systems requires knowledge from multiple domains rather than relying solely on a single subfield of biomedicine. Named entity recognition (NER) and relation extraction (RE) are two crucial steps in knowledge extraction. NER is responsible for identifying and annotating the positions and types of entities within a text, while RE identifies and determines the relationships between the recognized entities.

Named entity recognition aims to identify specific referring entities within a text. The approach involves learning features from annotated sample data to automatically recognize relevant medication entities in unannotated data, providing foundational support for applications such as information extraction and natural language processing. Currently, NER based on deep learning is a major research focus. The general process is as follows: a word-embedding layer converts natural language into indices recognizable by computers, then features are extracted, and finally, the NER task is transformed into a label classification task to output the probabilities of entity labels in sentences. Presently, the word vectors generated in research are based on general domains and have not been adapted to the medical domain. Medical entities are more complex and specialized compared to ordinary entities, leading to poor performance of existing models in recognizing medical entities. To address issues such as the unclear boundary identification of medical entities and ineffective utilization of contextual information in existing models, this paper constructs a Chinese medical domain NER model. This model enhances the overall recognition performance, laying the groundwork for subsequent relation extraction research.

Relation extraction is another crucial component of knowledge extraction, which is tasked with identifying relationships between entities in vast amounts of natural language data to construct entity–relationship–entity links. After identifying entities, determining the specific relationships between them often involves classification tasks in natural language processing (NLP). Given two entities, the task is to predict the type of relationship between them. Currently, many methods for relation extraction use machine learning or neural networks. Medical texts are characterized by context-related information, numerous terms, and uneven entity positions. Despite the structured terminology and standardized language of medical texts, existing research has not specifically designed extraction structures for relation extraction, which may impact the final determination of entity relationships. Therefore, relation extraction in the medical domain requires further research and adjustments to existing methods to fully consider the peculiarities of medical text data. To address the issue of inadequate information utilization in medical domain relation extraction, this paper integrates various features of characters, entities, and texts. By fully utilizing the relative positional information and entity category information, combined with contextual information, the model’s effectiveness in extracting relationships from the text is enhanced.

This paper proposes a named entity recognition (NER) method and an entity relation extraction (RE) method in the Chinese medical domain. The goal is to enhance the efficiency and quality of knowledge graph production. The main contributions of this paper are summarized as follows:

(1) This study proposes a named entity recognition (NER) model for the Chinese medical domain. The model utilizes MCBERT for character-level embedding pre-training on the corpus, employs a BiGRU network for contextual feature extraction, and integrates a CRF layer to optimize label sequence output. It specifically addresses the shortcomings of traditional methods, which struggle with the rigorous nature, uneven knowledge distribution, and high contextual relevance of Chinese medical texts. The model accurately identifies various entity categories from Chinese medical texts.

(2) This study also presents a relation extraction model for the Chinese medical domain, proposing an FF-PCNN structure to capture comprehensive semantic information. It effectively addresses the challenges posed by the uneven distribution of entity occurrences and complex contextual relationships in existing methods, which often underutilize character relative position and entity information. This model enhances the accuracy of entity relation classification in Chinese medical texts, effectively capturing relationships by utilizing both positional and textual information between entities.

(3) A Rational Medicine Use System using microservice architecture was developed, employing the proposed NER and relation extraction models to construct a medical knowledge graph. This system can perform tasks such as medication recommendations and prescription checks.

2. Related Work

A knowledge graph is a semantic network organized in the form of graph data, consisting of entity nodes and the relationships between them. Its core idea is to categorize information according to concepts and construct a rich knowledge system through the relationships among entities. Concepts correspond to nodes in the knowledge graph, while relationships represent the connections between entities. Attributes and instances further enrich the information content of each node.

In the early days, manually integrating information resources to form knowledge graphs was a common approach, such as WordNet [1] and OpenCyc [2]. Today, the construction of knowledge graphs primarily relies on internet resources and utilizes technologies like NPL, information extraction, and machine learning to achieve semi-automated or automated construction. Prominent general-domain knowledge graphs include Wikidata [3], YAGO [4], DBpedia [5], and Freebase [6]. In the Chinese language domain, there are mature knowledge graph projects, such as SSCO [7], Zhishi.me [8], and CN-Probase [9].

Domain-specific knowledge graphs differ from general knowledge graphs in terms of data sources. Domain-specific knowledge graphs are primarily derived from industry data within a particular field and typically contain specialized knowledge. In contrast, general knowledge graphs mainly use forum or encyclopedia data as sources, resulting in more universally applicable knowledge. The precision requirements for these two types of knowledge graphs also vary. Domain-specific knowledge graphs generally require support for domain-specific decision making and analysis. Domain-specific knowledge graphs can be viewed as a branch of general knowledge graphs [10]. Notable research in related fields includes the Unified Medical Language System developed by the Institute of Medical Information, Chinese Academy of Medical Sciences [11], the knowledge graph proposed for Traditional Chinese Medicine by the China Academy of Chinese Medical Sciences [12], Miao et al.’s successful construction of a knowledge graph for respiratory diseases [13], and Geleta et al.’s [14] development of a biological insights knowledge graph using data from sources such as OpenTargets [15].

Initially, researchers addressed the NER task by using word formation rules, punctuation, and dictionaries to create templates for text matching [16]. Later, machine learning techniques were introduced, such as using Support Vector Machines (SVM) [17] to predict characters within sentences, although the results were not as expected. Subsequently, the Conditional Random Fields (CRF) model [18] was proposed, which considers global information within the text data, providing advantages in handling complex sequence data. In recent years, Huang et al. [19] were the first to apply the bidirectional LSTM-CRF model to benchmark sequence labeling datasets in NLP. Song et al. [20] proposed a novel supervised learning method that introduced a multidimensional self-attention mechanism to assess the importance of context for the current word. This advancement allowed subsequent CNN models to better capture long-term dependencies within sentences, and the method was eventually applied to biomedical named entity recognition tasks.

Previously, Guo Jianyi et al. [21] proposed a multi-kernel fusion method for entity relation extraction in the Chinese domain. Tian et al. [22] introduced a relation extraction method based on an attention mechanism and graph neural convolutional networks. The research on joint models has increased, with Yuan et al. [23] developing a specific Relation-Aware Self-Attention Network (RSAN) to improve the relation extraction performance. RSAN uses a relation-aware attention mechanism to construct specific sentence representations for each relation, which are followed by sequence labeling to extract the corresponding head and tail entities. In the field of medical entity relation extraction, Dou et al. [24] utilized recurrent neural networks and convolutional neural networks for feature extraction from positional embeddings and external key text information, resulting in a medicine interaction entity extraction model. Ding et al. [25] combined attention mechanisms with bidirectional Long Short-Term Memory networks for relation extraction in the medical domain, aiming to address the issue of data scarcity in Chinese biomedical entity relation extraction.

As the demand for specialization in applications increases, general-domain knowledge graphs fall short in quality and depth to meet specialized customization needs. In implementing medication assistance, relying solely on a single subfield of biomedicine is clearly insufficient. Knowledge from multiple domains is required, yet current research often focuses on specific fields, such as traditional Chinese medicine or cardiology, and the functionalities needed for medication assistance are still lacking in existing studies. To address these issues, this paper aims to design a comprehensive ontology for the medical auxiliary medication domain and construct an associated knowledge graph based on this ontology. Through the design of the ontology and the construction of the knowledge graph, the goal is to overcome the limitations of current understanding of the ever-increasing variety of medicines, allowing the knowledge graph to more comprehensively cover various medical fields and ultimately focus on applications related to medication assistance.

Among various NER models, MCBERT-CRF [26] passes the feature vectors of Chinese medical texts directly to a linear layer to reduce dimensions to the named entity label level, applying CRF for sequence labeling. The NER With Dic [27] model incorporates dictionary information into Chinese NER by merging word dictionaries with character-level representations, simplifying the sequence processing framework, and improving recognition accuracy and inference speed. The BERT-BiGRU-CRF [28] model uses BERT to construct character-level vectors, employs bidirectional GRU for deep feature extraction, and utilizes a CRF layer for sequence labeling. In this paper, the above three models are compared against the proposed NER model.

For relation extraction, the Unire [29] model establishes separate label spaces for entity detection and relation classification, introduces a shared label space to enhance task interaction, and designs an approximate joint decoding algorithm to output the final entities and relations. The PURE [30] model proposes a simplified extraction approach that only uses the entity model as input for the relation model, sacrificing some accuracy for improved training efficiency. The two models are compared with the proposed model in the following relation extraction experiments.

3. Overall System Design

The system architecture is illustrated in Figure 1. Initially, web crawlers are used to gather medicine-related information. The data are then screened, filtered, and cleaned according to predefined methods to construct a medicine information database. The collected information undergoes text recognition, and text annotation and relation extraction are performed according to the methods proposed in this paper. This process completes the construction of a medical knowledge graph, which provides the foundation for developing a medication assistance system.

3.1. Access to Medical Data

Many medical websites have sections dedicated to medicine instructions, where most of the data are in semi-structured or unstructured formats. Subsequent entity and relation extraction primarily relies on these data. After investigation, it was found that the medicine instruction data on the YaoRongYun website (www.pharnexcloud.com (accessed on 7 August 2024)) are relatively complete and well organized, although there is some data duplication. However, after collection, deduplication can be performed, making it a suitable data source.

After collecting data, the candidate entries are subjected to data cleaning. The primary focus is on the names and contents of the entries. Entries with all data fields empty after traversal are discarded. Based on the filtered candidate entries, as shown in Figure 2, we obtained a dataset consisting of 75,362 medicine instruction documents, which have undergone deduplication, filtering, and cleaning processes.

The primary basis for constructing the medical dataset is the medical knowledge graph framework. Specific information from medicine instruction manuals is selected for entity recognition and relation mapping. After the annotation work is completed, the data are organized to form a comprehensive dataset. The data annotation process involves two main components: entity recognition and relation mapping. Entity recognition entails identifying and labeling terms with domain-specific meanings in the text and classifying them into corresponding entity types. Relation mapping aims to identify the interactions between entities within the text and represent them in a triplet format (subject, relation, object), where the relation defines the connection between two entities. Data are annotated based on the established concepts and relationships. The specific structure is shown in Figure 3 and Figure 4.

Referencing the entity annotation standards for medical texts by Zhang et al. [31] and considering the requirements for medication assistance, this paper defines eight major categories for entities in the corpus. For entity annotation, the “BIO” tagging method is proposed: “B” denotes that the character is at the beginning of an entity, “I” signifies that it is in the middle or end of an entity, and “O” indicates that the character is not part of an entity. The data in the test set were annotated according to this scheme, resulting in a final dataset of 3215 annotated samples.

In the entity label statistics table shown in Table 1, the label “people” is further specified into its subclasses during the actual recognition process. For example, the category “people” is subdivided into specific entity labels such as “elderly”, “children”, etc. For instance, the entity labels for the “elderly” category are B-OLD and I-OLD. Similarly, the entity labels for “children” are B-KID and I-KID. For “pregnant women”, the entity labels are B-PRG and I-PRG.

In this paper, the dataset used does not exhibit class imbalance issues. However, when using other datasets, techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and weighted loss functions can be employed to perform preprocessing for imbalanced training data:

Synthetic Minority Over-sampling Technique (SMOTE): It addresses class imbalances by generating new minority class samples through interpolation between existing minority samples. This approach effectively increases the number of minority class samples, enhancing the model’s ability to learn from these classes while avoiding overfitting. However, the generated synthetic samples might introduce noise, and the computational complexity of nearest neighbor searches can be high.

Weighted Loss Function: This method assigns higher weights to minority class samples during training, compelling the model to pay more attention to these samples. It is straightforward to implement and does not require altering the data distribution, making it directly applicable to existing models and loss functions. Nonetheless, the selection of appropriate weights is critical; improper weighting can lead to instability during training, and the method may be less effective for datasets with extreme imbalance.

3.2. Pharmaceutical Entity Extraction Based on MCBERT-BiGRU-CRF Modeling

Entity extraction is crucial for identifying different entity categories within a text, which is essential for completing tasks such as information extraction, knowledge graph construction, and question-answering systems in Natural Language Processing.

This section addresses the characteristics of high specialization and contextual relevance in the Chinese medical domain by proposing a new model, MCB-CRF. This model is based on BERT word embeddings and is specifically fine-tuned for the Chinese medical corpus. Its structure is illustrated in Figure 5.

Character Embedding Layer: Biomedical texts possess unique characteristics compared to general texts, such as specialized medical terminology, diverse vocabulary sources, and precise expression. In the Chinese text, there are often no clear boundaries between words, and the use of punctuation may vary in meaning. The MC-BERT model, designed specifically for Chinese medical texts, is based on the BERT architecture and uses characters as input to the model. This design allows the model to directly incorporate medical entity knowledge, enabling it to learn the semantics of words after obtaining word embeddings, enhancing performance in downstream tasks such as NER.

To input data into the network in tensor form, samples of varying lengths within a batch need to be padded or truncated to a uniform length of the preset maximum length

m

. The unified text is then tokenized using a tokenizer to obtain the ID representations of characters in the dictionary. These IDs are subsequently fed into the embedding layer to obtain character-level embeddings, as indicated for (1) and (2).

i n p u t_i d s = T o k e n i z e r (S) = (b_{1}, b_{2}, b_{n}, \dots, b_{m + 2})

(1)

l o g i t s_o u t = M C B E R T (i n p u t_i d s)

(2)

The maximum length of the text minus the current text length is the number of fills. The dimension of the final output vector of (2) is

m \times d_{B E R T}

, where

d_{B E R T}

is taken as 768 in the pre-trained model.

Feature extraction layer: For Chinese medical texts, changes in the order or context of words will affect the meaning of the context. The impact of the features of contextual semantics should be considered to further subdivide the granularity of medical knowledge.

The Gated Recurrent Unit (GRU) is a simplified version of the Long Short-Term Memory (LSTM) network that effectively addresses the issue of long-distance dependencies in sequences. GRU merges the forget gate and the input gate of LSTM into a single update gate and simplifies the cell state. The reset gate

r_{t}

regulates the combination of the current input

x_{t}

and the hidden state output

h_{t - 1}

from the previous time step, as calculated by (3). The update gate combines the functionalities of the forget gate and input gate in the LSTM structure, determining whether to retain information from the previous time step or incorporate information from the current time step. The relevant calculation is shown in (4).

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}] + b_{r})

(3)

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}] + b_{z})

(4)

W_{r}

represents the weight matrix required to be trained for the reset gate,

b_{r}

refers to the bias term of the reset gate,

W_{z}

is the weight matrix required to be trained for the update gate, and

b_{z}

is the bias term of the update gate. The output of GRU is determined by (5), where

{\tilde{h}}_{t}

refers to the candidate information of the hidden layer at the current time point, and its calculation method is given by (6).

h_{t} = (1 - z_{t}) \times h_{t - 1} + z_{t} \times {\tilde{h}}_{t}

(5)

{\tilde{h}}_{t} = t a n h (W_{\tilde{h}} \cdot [r_{t} \times h_{t - 1}, x_{t}] + b_{\tilde{h}})

(6)

This study introduced a bidirectional GRU (BiGRU) unit in the model design to extract the feature vector X of the Chinese medical text and its context feature vector H, as described in (7). The output of the backward GRU module and the forward GRU module of

\vec{G R U}

and

\overset{\leftarrow}{G R U}

are concatenated to obtain the output of the BiGRU module.

H = B i G R U (X) = [\vec{G R U} (X), \overset{\leftarrow}{G R U} (X)] = (h_{1}, h_{2}, \dots, h_{m})

(7)

Decoding layer: For labeling medical texts, the sequence labeling task for characters predicts the label for each character. We need to determine the effectiveness and reasonableness of the recognition based on the contextual association of the labels in medical texts. When using deep learning models for NER in medical texts, the output of feature extraction is often combined with a Conditional Random Field (CRF) layer. The CRF layer can make adjustments through its constraint mechanism to avoid errors and constrain the final output by leveraging the interdependence of labels, reducing the error rate and achieving effective label decoding, ultimately obtaining the most probable correctly labeled sequence of sentences.

When the conditional probability distribution

P (Y |X)

of the input observation sequence

X

can be represented by an undirected graphical model composed of the state sequence

Y

, the distribution of

X

with respect to

Y

is regarded as a conditional random field; that is, it satisfies (8).

P (Y_{i}| X, Y_{1}, Y_{2}, \dots, Y_{i - 1}, Y_{i + 1}, Y_{n}) = P (Y_{i} |X, Y_{i - 1}, Y_{i + 1})

(8)

The probability distribution density

P (Y |X)

at this time is given by (9).

P (Y |X) = \frac{1}{Z (x)} \exp [\sum_{i, k} λ_{k} f_{k} (y_{i - 1}, y_{i}, x, i) + \sum_{i, l} u_{i} h_{i} (y_{i}, x, i)]

(9)

where

f_{k} (y_{i - 1}, y_{i}, x, i)

is the transfer characteristic function, which is used to measure the influence between adjacent state variables,

h_{i} (y_{i}, x, i)

is the state characteristic function, which mainly measures the influence of the observation sequence on the state variable,

λ_{k}

and

u_{i}

are the weights of the corresponding characteristic functions, and

Z (x)

is the normalization factor, which is used to ensure that the entire formula is a probability. The calculation method of

Z (x)

satisfies (10).

Z (x) = \sum_{y} \exp [\sum_{i, k} λ_{k} f_{k} (y_{i - 1}, y_{i}, x, i) + \sum_{i, l} u_{i} h_{i} (y_{i}, x, i)]

(10)

For a label sequence, the label at a given time is determined only by its corresponding observation sequence and the labels at adjacent time steps. In practical applications, we first calculate the set

Y_{s}

of all possible label sequences for the observation sequence and finally output the sequence

Y^{*}

with the largest conditional probability, as shown in (11).

Y^{*} = a r g m a x P (Y | X), Y \in Y_{s}

(11)

The pseudo-code for model training is shown in Algorithm 1.

Algorithm 1. MCB-CRF named entity recognition algorithm.

Input: Training Set

t r a i n_s e t

, Test Set

t e s t_s e t

, MCBERT Model Parameter Table

m_c o n f i g

, Number of Iterations

k

, Batch Size

N

Output: Entity Extraction Model

M

, Accuracy

p r e

, Recall

r e c a l l

, Average

F 1

1: Initialize with the parameters

m_c o n f i g

of the pre-trained model
2: for

e p o c h \leftarrow 1

to

k

do
3: for

b a t c h_i d \leftarrow 1

to

l e n g t h (t r a i n_s e t) / N

do

4:

t e x t, l a b e l = g e t (b a t c h_i d)

//Get sentence and tag information
5:

x = M C B E R T (t e x t)

//Input to the embedding layer to obtain word vector
6:

f o r w a r d_o u t = g r u (x)

//Get the forward GRU output
7:

b a c k w a r d_o u t = r e v e r s e (f o r w a r d_o u t)

//Backward GRU output
8:

g r u_o u t = c o n c a t (f o r w a r d_o u t, b a c k w a r d_o u t)

//Bidirectional GRU output
9:

r e s u l t = c r f (g r u_o u t)

//CRF layer output results
10:

l o s s = c a l c u l a t e (r e s u l t, l a b e l)

//Calculate the loss function and update the model parameters
11: end for
12:

t e s t_l o s s = c o m p u t e_l o s s (t e s t_s e t)

//Test accuracy using the test set
13: if

t e s t_l o s s < l o s s_m

then
14:

M . s a v e (t e s t_l o s s)

//Save the model that performs best on the test set
15: end if
16:

p r e, r e c a l l = M (t e s t_s e t)

//Calculate evaluation indicators through the model
17:

F 1 = p r e * r e c a l l * 2 / (p r e + r e c a l l)

//Calculating the F1 value
18: end for

19: return M, pre, recall, F1

3.3. Chinese Medical Text Relation Extraction Based on FF-PCNN Model

In relation extraction, the text sequence and the existing entity set are typically predetermined. The goal is to determine whether a relationship exists between two entities and, if so, to identify the type of the relationship. This paper leverages deep learning techniques to improve upon existing relation extraction models, proposing a fusion feature relation extraction model, FF-PCNN (Fusion Features, FF; Piecewise Convolutional Neural Network, PCNN). This model effectively utilizes the positional information between entities and the textual information surrounding them, incorporating entity type information into the features to effectively capture the relationships between entities. As shown in Figure 6, the model consists of four parts: a word vector generation layer, position vector embedding layer, PCNN layer, and output layer.

Word vector generation layer: To better understand the meaning of medical texts, the model often processes medical text by mapping each character to low-dimensional dense word vectors through a word vector generation module. This process converts high-dimensional natural language into a form that processors can understand and handle. In this paper’s task of relation extraction in medical texts, the same word vector generation scheme used for named entity recognition is employed to better adapt to the vectorization process of Chinese medical texts. To extract local features, the text is converted into character-level embedding

X_{i}^{f}

using a pre-trained model. Its dimension

m \times d_{B E R T}

is given by the output dimension

d_{B E R T}

of the previous word embedding layer and the number of vectors

m

.

Position vector embedding layer: In the task of extracting relationships between medical entities, the positional information of characters relative to entities is crucial. The distance between characters and medical entity pairs reflects their relationship to some extent. To address issues arising from the relative position of characters and entities, this section extracts the positional features of characters relative to entities to describe this positional information. For example, in the sentence “安喘片治疗咳嗽” (Anchuan tablets treat cough), the character “治” (treat) has relative distances of 1 and −2 to the head entity “安喘片” (Anchuan tablets) and the tail entity “咳嗽” (cough), respectively. Define the character

c_{i}

at position

i

in the text sentence. The position of the leftmost character of the entity relative to the current text is recorded as

b e g i n

. The position of the rightmost character of the entity relative to the current text is recorded as

e n d

. The class of an entity is denoted by

t y p e

. Then, the entities in the sentence can be represented as

e = (b e g i n, e n d, t y p e)

, and the calculation method of the position

p_{i}

of character

c_{i}

relative to entity

e

is given by (12).

p_{i} = \{\begin{cases} i - b e g i n, i < b e g i n \\ 0, b e g i n \leq i \leq e n d \\ i - e n d, i > e n d \end{cases}

(12)

The lookup table converts it into the corresponding vectors

v_{i}^{s}

and

v_{i}^{o}

and then concatenates the two to obtain the relative position embedding

X_{i}^{p}

. The calculation formula is as follows (13):

X_{i}^{p} = [v_{s}, v_{o}]

(13)

The final concatenated word vector representation is shown in Figure 7. Finally, the feature representation

X_{i}

of the input character is obtained by concatenating the character embedding

X_{i}^{f}

and the position embedding

X_{i}^{p}

, as shown in (14).

X_{i} = [X_{i}^{f}, X_{i}^{p}]

(14)

Feature extraction layer: After the previous processing steps, the features of character and positional vectors in medical texts can be obtained. To better learn these two features, this chapter employs the PCNN model to extract the position features of entities and the contextual features around the entities. By concatenating the character relative position features with the character vectors as inputs, the PCNN can deeply encode and extract the contextual information of the medical text and the relative positional information between entities. This approach enhances the accuracy and efficiency of the relationship extraction task in medical texts. The calculation method for the aforementioned input features is shown in (15).

F = ([X_{1}^{f}, X_{1}^{p}], [X_{2}^{f}, X_{2}^{p}], \dots, [X_{i}^{f}, X_{i}^{p}]) = (f_{1}, f_{2}, \dots, f_{i})

(15)

The PCNN model slices the output of the convolution filter of a sentence containing two entities based on the positions of the entities and then performs max-pooling on these slices. This approach effectively captures the semantic information of the sentence, improving the performance of the relationship extraction model. The schematic diagram of the model is shown in Figure 8. Assume that there are two entities

e_{1} = (b e g i n_{1}, e n d_{1}, t y p e_{1})

and

e_{2} = (b e g i n_{2}, e n d_{2}, t y p e_{2})

in the sentence whose relations are to be extracted. From the above, we know that

b e g i n_{1}

and

b e g i n_{2}

represent the position of the first character of the current entity in the text, and

e n d_{1}

and

e n d_{2}

represent the position of the last character of the current entity in the text. The output feature vector

P

of PCNN can be calculated from (16) to (19).

P_{a} = M a x P o o l (C N N_{a} (f_{1}, f_{2}, \dots, f_{\min (e n d_{1}, e n d_{2})}))

(16)

P_{b} = M a x P o o l (C N N_{b} (f_{\min (b e g i n_{1}, b e g i n_{2})}, \dots, f_{\max (e n d_{1}, e n d_{2})}))

(17)

P_{c} = M a x P o o l (C N N_{c} (f_{\max (e n d_{1}, e n d_{2})}, \dots, f_{m - 1}, f_{m}))

(18)

P = [P_{a}, P_{b}, P_{c}]

(19)

The pseudo-code for feature extraction using PCNN is shown in Algorithm 2.

Algorithm 2. PCNN feature extraction algorithm.

Input: Number of training sessions

n

. Batch size

i

. Character vector concatenated with the position vector x

Output: After piecewise convolution, the output vector y
1: for

e p o c h \leftarrow 1

to

n

do:
2: for

j \leftarrow 1

to

i

do:

3:

x = t r a n s p o s e (x)

//Transpose x to the format of the convolutional layer input
4:

x = c o n v 1 d (x)

//Perform a one-dimensional convolution on x
5: for

i \leftarrow 0

to

2

do
6:

p o o l [i] =

maxpool

(m a s k [:, i : i + 1, :])

//Perform maximum pooling on features in segments

7: end for
8:

y = c o n c a t (p o o l 1, p o o l 2, p o o l 3)

//Concatenate features to obtain the output vector
9: end for
10: end for

11: return y

Output layer: The overall algorithm process of the relationship extraction model is shown in Algorithm 3.

Algorithm 3. FF-PCNN relation extraction model algorithm.

Input: Training set

t r a i n_s e t

, Test set

t e s t_s e t

, MCBERT model parameter table

m_c o n f i g

, Iterations k, Batch size L

Output: Relation extraction model R, Accuracy

p r e

, Recall

r e c a l l

, Average value

F 1

1: Initialize with the parameters

m_c o n f i g

of the pre-trained model
2: for

e p o c h \leftarrow 1

to k do
3: for

b a t c h_i d \leftarrow 1

to

l e n g t h (t r a i n_s e t) / L

do

4:

t e x t, r e l a t i o n, t y p e 1, t y p e 2 = g e t (b a t c h_i d)

//Get tags and entity information
5:

x = M C B E R T (t e x t)

//Input to the embedding layer to obtain word vector
6:

p o s i t i o n = g e t_p o s i t i o n (t e x t, t y p e 1, t y p e 2)

//Get relative entity position information
7:

f e a t u r e = c o n c a t (x, p o s i t i o n)

//Splicing to get joint feature information
8:

p c n n_o u t = p c n n (f e a t u r e)

//The features are obtained through segmented
convolution
9:

t y p e_f e a t u r e = M C B E R T (t y p e 1, t y p e 2)

//Generate entity vector
10:

o u t = c o n c a t (t y p e_f e a t u r e, p c n n_o u t)

//Concatenate to get the joint feature vector
11:

c l a s s i f i c a t i o n = s o f t \max (o u t)

//Predicting relation classification using fully connected layers
12:

l o s s = c a l c u l a t e (c l a s s i f i c a t i o n, r e l a t i o n)

//Calculate loss and update model
parameters
13: end for
14:

t e s t_l o s s = c o m p u t e_l o s s (t e s t_s e t)

//Test accuracy using the test set
15: if test_loss < test_loss_min then
16:

R . s a v e (t e s t_l o s s)

//Save the best performing model on the validation set
17: end if
18:

p r e, r e c a l l = R (t e s t_s e t)

//Calculate evaluation indicators through the model
19:

F 1 = p r e * r e c a l l * 2 / (p r e + r e c a l l)

//Calculate F1 value
20: end for

21: return R, pre, recall, F1

The output layer of PCNN and the entity feature information are concatenated and input into the output layer as a new vector. The information extracted by the feature extraction layer includes character features and the position between entities. The entity features can be obtained by extracting the entity type. The entity type features and the features of the feature extraction layer are concatenated according to (20) to obtain the final input vector

M

of the output layer, and the indexes are obtained by mapping the subject entity

t_{s}

and the object entity

t_{o}

in the word vector generation layer, respectively.

M = [P, t_{s}, t_{o}]

(20)

After merging the features output by PCNN with the entity category features, the features are further learned through the fully connected layer to obtain the output

R

. Finally,

R

is mapped from vector to probability by applying a Softmax classifier, and the relation category with the maximum probability value is selected as the output. Here,

W_{R}

is the parameter matrix to be learned,

b

is the bias parameter to be learned,

S

is the set of sentences processed by the model, and

y

is the predicted probability of each type. The calculation method is shown in (4)–(11) and (21).

R = M \cdot W_{R} + b

(21)

p (y | S) = S o f t m a x (R)

(22)

The FF-PCNN model proposed in this paper has demonstrated good performance in extracting relationships from Chinese medical texts. However, the interpretability of such models is relatively low. This is due to the FF-PCNN model’s multiple neural network layers, including the MCBERT embedding layer, positional embedding layer, PCNN feature extraction layer, and output layer, each containing numerous parameters and complex computations. Consequently, understanding the internal workings of the model is challenging. The model integrates character features, positional features, and entity type features to generate the final input vector. This feature fusion approach is quite complex, making it difficult to analyze which features significantly impact the model’s final output. Techniques like Local Interpretable Model-agnostic Explanations (LIME) [32] or SHapley Additive exPlanations (SHAP) [33] can be used to analyze which features have a significant impact on the model’s predictions. This helps in understanding how the model identifies relationships between entities. Additionally, visualization techniques such as heatmaps or attention mechanism maps can be employed to show which regions or features the model focuses on, aiding in understanding how the model interprets medical texts and performs relationship extraction.

4. Experimental Results

The experimental platform configuration of this article is shown in Table 2.

The specific evaluation indicators used in this experiment and their definitions and calculation methods are as follows:

Precision (P) is the ratio of the number of samples correctly predicted as positive by the model to the total number of samples predicted as positive. This metric reflects the accuracy of the model’s predictions. The calculation is shown in (23), where TP represents the number of samples that are predicted as true by the model and are actually true, and FP represents the number of samples that are predicted as true by the model but are actually false.

P = \frac{T P}{T P + F P} \times 100 %

(23)

Recall (R) is the proportion of actual positive samples that are correctly predicted as positive by the model. This metric indicates the model’s ability to identify true positives. The calculation is given in Equation (24), where TP is defined as in Equation (23), and FN represents the number of samples that are actually positive but are predicted as negative by the model.

R = \frac{T P}{T P + F N} \times 100 %

(24)

The F1 score is the harmonic mean of P and R, which is calculated as shown in (25).

F 1 = \frac{2 \times P \times R}{P + R} \times 100 %

(25)

4.1. Experimental Verification of Medical Entity Extraction of MCBERT-BiGRU-CRF Model

The NER experiment in this study involves several steps: first, collecting the required data from the internet, followed by data cleaning, filtering, and labeling as part of the preprocessing. Next, the preprocessed data are used to fine-tune a pre-trained model. Subsequently, the semantic features of the sentence context are extracted using a bidirectional GRU layer, and the Softmax function is employed to assign the most probable label to each character. Following this, a CRF layer is used for decoding to produce the optimal label sequence for the input. Additionally, the model’s performance is evaluated using a test dataset, and relevant performance metrics are employed to assess the model’s effectiveness.

The optimal experimental settings are determined by testing various dropout rates and learning rates. The model performs best when the dropout rate is set to 20% and the learning rate is set to 0.005. A comparison of different dropout rates is shown in Table 3.

Table 4 shows the recognition performance of the MCB-CRF model on different entity types, using the F1 index as the evaluation criterion. The results show that the model achieves good results on most entity types.

To evaluate the effectiveness of the proposed named entity recognition (NER) model on Chinese medical texts, this section selects several methods that have shown excellent performance in the NER field in recent years for comparative testing. The initial hyperparameters of the models are uniformly set, and all training is conducted using the same dataset. The specific models are as follows:

MCBERT-CRF [26]: The feature vector X of the Chinese medical text is directly passed to a linear layer to reduce its dimension to the level of named entity labels, and then CRF is applied to perform the sequence labeling task.
NER With Dic [27]: A Chinese named entity recognition technology that introduces dictionary information. This method avoids building a complex sequence processing framework by integrating word dictionaries into character-level representations. This model is characterized by its simple structure, fast reasoning ability and cross-platform capability. At the same time, the added vocabulary information helps the model more accurately determine the boundaries of named entities, which improves the recognition accuracy.
BERT-BiGRU-CRF [28]: The classic BERT model is used to construct character-level vectors, which are then fed into a bidirectional GRU for in-depth feature extraction and finally passed through a CRF layer for accurate sequence labeling.

The results obtained after training are shown in Table 5.

By analyzing the above experiments, the following conclusions can be drawn:

Comparing the results of Experiments 1 and 4 reveals that the inclusion of the BiGRU component significantly improves the model’s precision, recall, and F1 score. This indicates that the BiGRU component effectively enhances the model’s feature extraction capability, enabling a more accurate identification of entities related to the features. On the relatively small dataset used in this study, relying solely on MCBERT for feature extraction, is not ideal. This is likely because the BERT model is quite large and typically requires a substantial dataset to fine-tune the parameters for optimal performance. By integrating the BiGRU model, the feature extraction capability of the model is further enhanced beyond what BERT alone can achieve.
Comparing the results of Experiments 2 and 4, it is evident that replacing the word vector encoding component with a pre-trained model on medical and biological corpora significantly improves the model’s recall rate. In Experiment 2, the granularity of words in the dictionary is relatively fine, resulting in less accurate recognition of medical texts containing entities with larger granularity. However, the named entity recognition model using MCBERT for word vector encoding can identify more named entities. Overall, MCBERT outperforms the Word2vec model in word vector encoding, indicating that the pre-trained vector model used in this chapter has superior capability in representing word feature information.
Comparing the results of Experiments 3 and 4 reveals that the pre-trained model focused on the biomedical domain performs slightly better than the model based on a broad domain. This is because the MCBERT model incorporates specialized knowledge from the medical field, enhancing the accuracy of the named entity recognition task. The experimental results confirm that the proposed named entity recognition model outperforms the comparison models in processing Chinese medical texts, further validating the model’s effectiveness.

In the Chinese medical domain, the number of papers related to knowledge graphs is relatively small. Sun [34] completed entity recognition and relation extraction on Chinese medical texts. It should be noted that the datasets used to train the models are different, so some experimental results may not be directly comparable. In Sun’s NER model, each element in the sequence is converted from a character representation to a vector representation in the embedding layer through a BERT word vector lookup table and dynamically constructed medical entity feature vectors. The sequence vector representation is then input into the encoder, which utilizes a BiGRU incorporating lexical information to automatically extract features and output hidden vectors. Finally, CRF is used as a decoder to predict the entity positions and entity type labels corresponding to each character in the sequence. According to Sun’s experimental results, the accuracy is 74.18% and the recall is 84.24%.

The comparison reveals that while the two models follow a similar general approach, Sun’s model uses BERT for embedding, whereas our model employs a pre-trained bi-directional encoding method using transformers specifically adapted to a Chinese medical corpus. This adaptation results in higher efficiency and better performance.

4.2. Experimental Verification of Chinese Medical Text Relation Extraction Using the FF-PCNN Model

When the entity set

E_{S_{i}}

, triple set

R_{S_{i}}

and entity category pair set

P

with possible relationships are defined in a given sentence

S_{i}

, the final training data

D

also need to be obtained by embedding the identifier into the sentence. The steps are given by (26), and the calculation of the final relationship

r

is shown in (27).

D = \{(S_{i}, e_{s u b}, e_{o b j}, t_{s u b}, t_{o b j}, r) | (e_{s u b}, e_{o b j}) \in E_{S i} \times E_{S i}, (t_{s u b}, t_{o b j}) \in P, i \in [1, m]\}

(26)

r = \{\begin{array}{l} r e l, (e_{s u b}, r e l, e_{o b j}) \in R_{S_{i}} \\ “ No Relationship ”, (e_{s u b}, r e l, e_{o b j}) \notin R_{S_{i}} \end{array}

(27)

Among them,

e_{s u b}

and

e_{o b j}

are the subject entity and object entity in this sentence,

t_{s u b}

and

t_{o b j}

are the categories of each entity, and

m

is the number of data to be processed in the dataset.

\times

represents the direct product between sets. The subsequent experiments in this section are conducted using the processed dataset.

The training parameters of the model in the experiment are shown in Table 6.

Table 7 presents the recognition results of FF-PCNN for various entity relations. According to the experimental results, the proposed model demonstrates good performance in identifying most entity relationships. For the entity types “medication” and “disease”, the relationships may fall into categories such as “treatment” and “adverse reaction”, which can lead to some misclassifications during the recognition process. As a result, the overall accuracy is slightly lower. However, the relationship recognition performance is generally good.

Next, the effectiveness of the proposed relationship extraction model will be evaluated. This section compares the proposed model with representative distant supervision relationship extraction methods from the perspective of deep learning through comparative experiments:

Unire (Wang et al., 2021) [29] establishes two independent label spaces for entity detection and relation classification, and it also proposes a method for sharing label spaces to facilitate interaction between the two tasks. Additionally, it designs an approximate joint decoding algorithm to output the final extracted entities and relations.
PURE (Zhong and Chen, 2021) [30] introduces a simple pipeline extraction model that integrates entity information early in the relation model and consolidates global context representations. It uses only the entity model as input for the relation model, sacrificing a slight reduction in accuracy for improved training efficiency.

By replicating the above comparison algorithms using the dataset in this study, the experimental results are shown in Table 8 and Figure 9. It can be observed that the proposed model demonstrates better extraction performance compared to the benchmark models. Compared to Unire, the FF-PCNN model has a simpler structure and achieves better feature extraction performance. This indicates the effectiveness of the proposed feature extraction approach for entity types and relative positions. Compared to PURE, the proposed model’s F1 score exceeds PURE by 1.6%. Upon analysis, while PURE also utilizes entity type information for feature concatenation, it merely concatenates entity types and character-level vector embeddings before classification. In contrast, the proposed model not only incorporates entity type information but also integrates contextual information and relative position information between the subject and object entities. This allows for a better extraction of hidden semantic information in the text, enhancing the accuracy of the relation extraction task. These results demonstrate the effectiveness of the proposed model’s design.

Sun employs a joint extraction method for relation extraction, constructing specific sentence representations for each medical relation based on an attention mechanism. This method calculates attention scores for each character under different medical relations, using the entity recognition model proposed by Sun to directly output the entity relations contained in the text. According to Sun’s results, the accuracy is 91.6% and the recall is 89.3%.

The comparison shows that the relation extraction methods used in this paper and Sun’s paper are fundamentally different. The models for named entity recognition and relation extraction in this paper are independent of each other, making the system easier to implement and more flexible while maintaining a high level of accuracy.

4.3. Rational Medicine Use System Demo

This system can conduct dosage reviews by examining the medication methods for specific populations. It assesses patient information to determine the category to which the patient belongs. Then, it compares the prescribed dosage and usage information with the data in the knowledge base for that medication to judge whether the prescription is appropriate for the patient. After the physician reviews and approves the prescription information, the corresponding prescription details are updated in the patient’s prescription information interface. By clicking the prescription information button in the patient information interface, one can view the prescription details and the corresponding disease knowledge graph. In this graph, the color of the recommended medication nodes, ranging from green to black, indicates the level of recommendation, with green being recommended and black not recommended. The prescription information page is shown in Figure 10.

5. Conclusions

In recent years, with the continuous development and promotion of intelligent medical systems, the integration of auxiliary medication and computer technology has become increasingly important in addressing the needs and practical scenarios of medication usage. The research on knowledge graphs in the medical domain has thus gained significant importance. This study delves into the construction of a medical knowledge graph using natural language processing and deep learning technologies. Through the design of ontologies and the study of knowledge extraction algorithms in the medical field, a Rational Medicine Use System based on a domain-specific knowledge graph has been constructed.

To address issues such as the misrecognition of medical-specific terms and long training times in existing pipeline extraction processes, this study avoids the traditional word segmentation and embedding method. Instead, it employs a character-level embedding approach for word vector representation. This approach effectively recognizes medical entities. To address the challenges of comprehensively utilizing contextual semantic information, entity type information, and relative position information in medical texts, this study adopts a model that integrates the positional information of entities and contextual information to form a combined vector. This combined vector is further fused with entity type information to output relationship probabilities. This approach predicts relationships between entities in Chinese medical texts, providing technical support for the extraction of entity relationships in medical knowledge graphs.

In future research, several directions can be pursued to enhance and expand the current study. In this study, a pipeline approach was used for information extraction in the medical domain. Although this method shows improvement over previous models, it still suffers from the drawback of error accumulation. The experiments for named entity recognition and relation extraction were conducted independently, yet the performance of NER can influence relation extraction during the knowledge extraction process. With the growing research on joint extraction methods, future work could explore the application of joint extraction of entities and relations in the medical field for Chinese texts. This approach could potentially mitigate the error accumulation issue and enhance the overall extraction accuracy.

Author Contributions

Conceptualization, C.Q. and Y.H.; methodology, C.Q., Z.W. and J.Z.; software, Z.W., J.Z. and F.X.; validation, Z.W., J.Z. and L.L.; formal analysis, C.Q. and Z.W.; investigation, Z.W., F.X. and L.L.; resources, Z.W., F.X. and L.L.; data curation, C.Q. and Z.W.; writing—original draft preparation, C.Q., Z.W. and L.L.; writing—review and editing, J.Z. and L.L.; visualization, Z.W., J.Z. and F.X.; supervision, L.L.; project administration, Y.H.; funding acquisition, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61801341).

Data Availability Statement

The medicine instruction data used in this research are available on the YaoRongYun website, which can be accessed at www.pharnexcloud.com, and we accessed on 24 June 2024. A set of obtained data from the experiment has been made available on a public google drive: https://drive.google.com/drive/folders/1NK-KeeSuWYy6hvQCTDsH8S2vD5OfwSoP?usp=sharing (accessed on 7 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Conesa, J.; Storey, V.C.; Sugumaran, V. Usability of upper level ontologies: The case of ResearchCyc. Data Knowl. Eng. 2010, 69, 343–356. [Google Scholar] [CrossRef]
Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
Rebele, T.; Suchanek, F.; Hoffart, J. YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames. In Proceedings of the 15th International Semantic Web Conference, Heraklion, Crete, Greece, 3–7 June 2017. [Google Scholar]
Lehmann, J.; Isele, R.; Jakob, M. DBpedia—A large-scale, multilingual knowledgebase extracted from Wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Bollacker, K.; Evans, C.; Paritosh, P. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, Vancouver, BC, Canada, 9–12 June 2008. [Google Scholar]
Hu, F.; Shao, Z.; Ruan, T. Self-supervised Chinese ontology learning from online encyclopedias. Sci. World J. 2014, 2014, 848631. [Google Scholar] [CrossRef] [PubMed]
Niu, X.; Sun, X.; Wang, H. Zhishi.me-weaving Chinese linking open data. In Proceedings of the Semantic Web–ISWC 2011: 10th International Semantic Web Conference, Bonn, Germany, 23–27 October 2011. [Google Scholar]
Chen, J.; Wang, A.; Chen, J. CN-Probase: A data-driven approach for large-scale Chinese taxonomy construction. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019. [Google Scholar]
Liu, Y.; Li, H. Summary of domain knowledge mapping studies. Comput. Syst. Appl. 2020, 29, 1–12. [Google Scholar]
Li, D.; Hu, T.; Li, J. Construction and application of Chinese integrated medical language system. Intell. Mag. 2011, 30, 147–151. [Google Scholar]
Jia, L.; Liu, J.; Yu, P. Construction of TCM knowledge map. J. Med. Inform. 2015, 36, 51–53+59. [Google Scholar]
Miao, L. Research on Intelligent Question and Answer Method Based on Chinese Medicine Knowledge Graph. Master’s Thesis, Xidian University, Xian, China, 2022. [Google Scholar]
Geleta, D.; Nikolov, A.; Edwards, G. Biological Insights Knowledge Graph: An integrated knowledge graph to support medicine development. Biorxiv 2021. [Google Scholar] [CrossRef]
Koscielny, G.; An, P.; Carvalho-Silva, D. Open Targets: A platform for therapeutic target identification and validation. Nucleic Acids Res. 2017, 45, D985–D994. [Google Scholar] [CrossRef]
Hou, M.; Wei, R.; Lu, L.; Lan, X.; Cai, H. Research review of knowledge graph and its application in medical domain. J. Comput. Res. Dev. 2018, 55, 2587–2599. [Google Scholar]
Lee, K.J.; Hwang, Y.S.; Kim, S. Biomedical named entity recognition using two-phase model based on SVMs. J. Biomed. Inform. 2004, 37, 436–447. [Google Scholar] [CrossRef]
Li, L.; Zhou, R.; Huang, D. Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 2009, 33, 334–338. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Song, X.; Feng, A.; Wang, W. Multidimensional self-attention for aspect term extraction and biomedical named entity recognition. Math. Probl. Eng. 2020, 2020, 8604513. [Google Scholar] [CrossRef]
Guo, J.Y.; Chen, P.; Yu, Z.T. Chinese Domain Entity Relation Extraction Based on Multi-Core Fusion. J. Chin. Inf. Process. 2016, 30, 24–29. [Google Scholar]
Tian, Y.; Chen, G.; Song, Y. Dependency-driven relation extraction with attentive graph convolutional networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021. [Google Scholar]
Yuan, Y.; Zhou, X.; Pan, S. A relation-specific attention network for joint entity and relation extraction. In Proceedings of the International Joint Conference on Artificial Intelligence, Online, 3–12 November 2021. [Google Scholar]
Dou, M.; Ding, J.; Chen, G. IK-DDI: A novel framework based on instance position embedding and key external text for DDI extraction. Brief. Bioinform. 2023, 24, bbad099. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.Y.; Yang, Z.H.; Luo, L. A Deep Learning-Based Chinese Biomedical Entity Relation Extraction System. J. Chin. Inf. Process. 2021, 35, 70–76. [Google Scholar]
Zhang, N.; Jia, Q.; Yin, K. Conceptualized Representation Learning for Chinese Biomedical Text Mining. arXiv 2020, arXiv:2008.10813. [Google Scholar]
Ma, R.; Peng, M.; Zhang, Q. Simplify the Usage of Lexicon in Chinese NER. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Wang, Y.; Sun, C.; Wu, Y.; Zhou, H.; Li, L.; Yan, J. UniRE: A Unified Label Space for Entity Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021. [Google Scholar]
Zhong, Z.; Chen, D. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 50–61. [Google Scholar]
Zhang, H.; Zong, Y.; Chang, B. Annotation Specifications for Medical Entities for Medical Text Processing (Medical Entity Annotation Standard for Medical Text Processing). In Proceedings of the 19th Chinese National Conference on Computational Linguistics, Haiko, China, 31 October–1 November 2020. [Google Scholar]
Zafar, M.; Khan, N. Deterministic local interpretable model-agnostic explanations for stable explainability. Mach. Learn. Knowl. Extr. 2021, 3, 525–541. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S. A Unified Approach To Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sun, Y. Research on Named Entity Recognition and Entity Relationship Extraction in the Field of Chinese Medicine. Master’s Thesis, Yanshan University, Qinhuangdao, China, 2023. [Google Scholar]

Figure 1. Overall design of Rational Medicine Use System.

Figure 2. Candidate entries.

Figure 3. Knowledge graph ontology structure.

Figure 4. Knowledge graph medicine structure.

Figure 5. Schematic of MCB-CRF model structure.

Figure 6. Relationship extraction model structure diagram.

Figure 7. Character and position vector representation.

Figure 8. Schematic diagram of PCNN model.

Figure 9. Comparison of experimental results.

Figure 10. Prescribing information page.

Table 1. Entity tag statistics.

Entity Type	Label
Drug/Medicine (DRG)	B-DRG, I-DRG
Disease (DIS)	B-DIS, I-DIS
Compound (COM)	B-COM, I-COM
Advice (ADV)	B-ADV, I-ADV
Method of Administration (MET)	B-MET, I-MET
Frequency of Medication (FRE)	B-FRE, I-FRE
Amount (AMT)	B-AMT, I-AMT
People (POP)	B-POP, I-POP
Non-Physical Character	O

Table 2. Experimental platform environment configuration.

Experimental Environment	Configuration
Processor	Intel (R) Core (TM) i7-13700KF
Software Environment	PyCharm 2020.2.1 x64
Memory	32 G
Operating System	Windows 10
Python	3.7.3
Pytorch	1.6.1
CUDA	11.6.11
GPU	NVIDIA GeForce GTX1060

Table 3. Learning rate comparison.

Learning Rate	Accuracy (%)	Recall (%)	F1 (%)
0.01	87.4	84.9	86.1
0.005	87.7	85.0	86.3
0.001	87.1	85.3	86.2
0.0001	84.1	80.9	82.4

Table 4. Entity recognition results.

Entity Name	Accuracy (%)	Recall (%)	F1 (%)
Medicine (DRG)	89.2	86.1	87.6
Disease (DIS)	87.5	90.2	88.8
Medicine (COM)	88.2	86.9	87.5
Advice (ADV)	89.1	85.6	87.3
Route of Administration (MET)	86.2	83.3	84.7
Frequency of Medication (FRE)	84.2	84.2	84.2
Dosage (AMT)	85.6	83.7	84.6
Elder (OLD)	90.5	86.4	88.4
Kids (KID)	91.2	87.8	89.4
Pregnant Women (PRG)	91.3	83.8	87.3

Table 5. Experimental results.

Model	Accuracy (%)	Recall (%)	F1 (%)
MCBERT-CRF [26]	86.8	82.4	84.5
NER With Dic [27]	87.8	84.3	86.0
BERT-BiGRU-CRF [28]	87.2	83.6	85.3
Proposed CBERT-BiGRU-CRF	88.3	85.8	87.0

Table 6. Model training parameters.

Parameter	Value
Learning rate	0.005
Batch size	64
Epoch	100
Dropout	0.2
Char_embedding	768
Type_embedding	100
Pos_embedding	50

Table 7. Entity relationship recognition results.

Relationship Category	Accuracy (%)	Recall (%)	F1 (%)
Element	87.2	85.3	86.2
Interaction(medicines)	84.3	80.1	82.1
Interaction(Medicines)	86.6	82.3	84.4
Treat	82.6	84.1	83.3
Adverse Reactions	82.1	80.6	81.3
Medication	84.9	80.7	82.7
Measurement	87.7	82.0	84.7
Frequency	88.3	84.5	86.3
Use	89.7	82.5	85.9
Methods	84.5	82.8	83.6

Table 8. Comparative experimental results.

Model	Accuracy (%)	Recall (%)	F1 (%)
Unire [29]	84.1	80.3	82.1
PURE [30]	83.3	81.6	82.4
Proposed FF-PCNN	85.7	82.5	84.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, C.; Wang, Z.; Zhao, J.; Liu, L.; Xiao, F.; Han, Y. A Novel Rational Medicine Use System Based on Domain Knowledge Graph. Electronics 2024, 13, 3156. https://doi.org/10.3390/electronics13163156

AMA Style

Qin C, Wang Z, Zhao J, Liu L, Xiao F, Han Y. A Novel Rational Medicine Use System Based on Domain Knowledge Graph. Electronics. 2024; 13(16):3156. https://doi.org/10.3390/electronics13163156

Chicago/Turabian Style

Qin, Chaoping, Zhanxiang Wang, Jingran Zhao, Luyi Liu, Feng Xiao, and Yi Han. 2024. "A Novel Rational Medicine Use System Based on Domain Knowledge Graph" Electronics 13, no. 16: 3156. https://doi.org/10.3390/electronics13163156

APA Style

Qin, C., Wang, Z., Zhao, J., Liu, L., Xiao, F., & Han, Y. (2024). A Novel Rational Medicine Use System Based on Domain Knowledge Graph. Electronics, 13(16), 3156. https://doi.org/10.3390/electronics13163156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Rational Medicine Use System Based on Domain Knowledge Graph

Abstract

1. Introduction

2. Related Work

3. Overall System Design

3.1. Access to Medical Data

3.2. Pharmaceutical Entity Extraction Based on MCBERT-BiGRU-CRF Modeling

3.3. Chinese Medical Text Relation Extraction Based on FF-PCNN Model

4. Experimental Results

4.1. Experimental Verification of Medical Entity Extraction of MCBERT-BiGRU-CRF Model

4.2. Experimental Verification of Chinese Medical Text Relation Extraction Using the FF-PCNN Model

4.3. Rational Medicine Use System Demo

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI