1. Introduction
In the field of healthcare management, medication has always been a critical and challenging issue. According to the World Health Organization, the rational use of medicines involves administering medicines in a manner that ensures their safety and effectiveness, at the right time, with accurate dosages, correct usage methods, and within a proper treatment period, all at a cost affordable to the public. In practice, the large number of prescriptions, the complexity of medication regimens, and the ever-increasing variety and information on medicines can lead to prescription errors or inefficient prescription reviews due to the limitations in professional knowledge of doctors and pharmacists. Therefore, establishing a safe and reliable automatic medication review system is crucial. An excellent medication assistance system is a fundamental component of clinical rational medicine use in medical institutions. It can systematically regulate the dosage, types, and frequency of medicine use, providing pharmacists with essential medicine information to aid in the review process. The implementation of such a system relies heavily on the support of medical knowledge graphs. These graphs integrate medical data, medicine information, and treatment plans into a computer system, offering users reliable medical knowledge. They enable dosage queries and medication reviews based on actual patient conditions, potentially improving decision-making efficiency, reducing medical errors, minimizing economic losses, and thus enhancing healthcare quality. This ensures safer and more effective medical services for patients.
As the demand for specialization in current applications increases, the quality and depth of knowledge provided by general-domain knowledge graphs are insufficient to meet specialized custom requirements. Consequently, research focus is shifting from general-domain knowledge graphs to domain-specific knowledge graphs. The construction of Rational Medicine Use Systems requires knowledge from multiple domains rather than relying solely on a single subfield of biomedicine. Named entity recognition (NER) and relation extraction (RE) are two crucial steps in knowledge extraction. NER is responsible for identifying and annotating the positions and types of entities within a text, while RE identifies and determines the relationships between the recognized entities.
Named entity recognition aims to identify specific referring entities within a text. The approach involves learning features from annotated sample data to automatically recognize relevant medication entities in unannotated data, providing foundational support for applications such as information extraction and natural language processing. Currently, NER based on deep learning is a major research focus. The general process is as follows: a word-embedding layer converts natural language into indices recognizable by computers, then features are extracted, and finally, the NER task is transformed into a label classification task to output the probabilities of entity labels in sentences. Presently, the word vectors generated in research are based on general domains and have not been adapted to the medical domain. Medical entities are more complex and specialized compared to ordinary entities, leading to poor performance of existing models in recognizing medical entities. To address issues such as the unclear boundary identification of medical entities and ineffective utilization of contextual information in existing models, this paper constructs a Chinese medical domain NER model. This model enhances the overall recognition performance, laying the groundwork for subsequent relation extraction research.
Relation extraction is another crucial component of knowledge extraction, which is tasked with identifying relationships between entities in vast amounts of natural language data to construct entity–relationship–entity links. After identifying entities, determining the specific relationships between them often involves classification tasks in natural language processing (NLP). Given two entities, the task is to predict the type of relationship between them. Currently, many methods for relation extraction use machine learning or neural networks. Medical texts are characterized by context-related information, numerous terms, and uneven entity positions. Despite the structured terminology and standardized language of medical texts, existing research has not specifically designed extraction structures for relation extraction, which may impact the final determination of entity relationships. Therefore, relation extraction in the medical domain requires further research and adjustments to existing methods to fully consider the peculiarities of medical text data. To address the issue of inadequate information utilization in medical domain relation extraction, this paper integrates various features of characters, entities, and texts. By fully utilizing the relative positional information and entity category information, combined with contextual information, the model’s effectiveness in extracting relationships from the text is enhanced.
This paper proposes a named entity recognition (NER) method and an entity relation extraction (RE) method in the Chinese medical domain. The goal is to enhance the efficiency and quality of knowledge graph production. The main contributions of this paper are summarized as follows:
(1) This study proposes a named entity recognition (NER) model for the Chinese medical domain. The model utilizes MCBERT for character-level embedding pre-training on the corpus, employs a BiGRU network for contextual feature extraction, and integrates a CRF layer to optimize label sequence output. It specifically addresses the shortcomings of traditional methods, which struggle with the rigorous nature, uneven knowledge distribution, and high contextual relevance of Chinese medical texts. The model accurately identifies various entity categories from Chinese medical texts.
(2) This study also presents a relation extraction model for the Chinese medical domain, proposing an FF-PCNN structure to capture comprehensive semantic information. It effectively addresses the challenges posed by the uneven distribution of entity occurrences and complex contextual relationships in existing methods, which often underutilize character relative position and entity information. This model enhances the accuracy of entity relation classification in Chinese medical texts, effectively capturing relationships by utilizing both positional and textual information between entities.
(3) A Rational Medicine Use System using microservice architecture was developed, employing the proposed NER and relation extraction models to construct a medical knowledge graph. This system can perform tasks such as medication recommendations and prescription checks.
2. Related Work
A knowledge graph is a semantic network organized in the form of graph data, consisting of entity nodes and the relationships between them. Its core idea is to categorize information according to concepts and construct a rich knowledge system through the relationships among entities. Concepts correspond to nodes in the knowledge graph, while relationships represent the connections between entities. Attributes and instances further enrich the information content of each node.
In the early days, manually integrating information resources to form knowledge graphs was a common approach, such as WordNet [
1] and OpenCyc [
2]. Today, the construction of knowledge graphs primarily relies on internet resources and utilizes technologies like NPL, information extraction, and machine learning to achieve semi-automated or automated construction. Prominent general-domain knowledge graphs include Wikidata [
3], YAGO [
4], DBpedia [
5], and Freebase [
6]. In the Chinese language domain, there are mature knowledge graph projects, such as SSCO [
7], Zhishi.me [
8], and CN-Probase [
9].
Domain-specific knowledge graphs differ from general knowledge graphs in terms of data sources. Domain-specific knowledge graphs are primarily derived from industry data within a particular field and typically contain specialized knowledge. In contrast, general knowledge graphs mainly use forum or encyclopedia data as sources, resulting in more universally applicable knowledge. The precision requirements for these two types of knowledge graphs also vary. Domain-specific knowledge graphs generally require support for domain-specific decision making and analysis. Domain-specific knowledge graphs can be viewed as a branch of general knowledge graphs [
10]. Notable research in related fields includes the Unified Medical Language System developed by the Institute of Medical Information, Chinese Academy of Medical Sciences [
11], the knowledge graph proposed for Traditional Chinese Medicine by the China Academy of Chinese Medical Sciences [
12], Miao et al.’s successful construction of a knowledge graph for respiratory diseases [
13], and Geleta et al.’s [
14] development of a biological insights knowledge graph using data from sources such as OpenTargets [
15].
Initially, researchers addressed the NER task by using word formation rules, punctuation, and dictionaries to create templates for text matching [
16]. Later, machine learning techniques were introduced, such as using Support Vector Machines (SVM) [
17] to predict characters within sentences, although the results were not as expected. Subsequently, the Conditional Random Fields (CRF) model [
18] was proposed, which considers global information within the text data, providing advantages in handling complex sequence data. In recent years, Huang et al. [
19] were the first to apply the bidirectional LSTM-CRF model to benchmark sequence labeling datasets in NLP. Song et al. [
20] proposed a novel supervised learning method that introduced a multidimensional self-attention mechanism to assess the importance of context for the current word. This advancement allowed subsequent CNN models to better capture long-term dependencies within sentences, and the method was eventually applied to biomedical named entity recognition tasks.
Previously, Guo Jianyi et al. [
21] proposed a multi-kernel fusion method for entity relation extraction in the Chinese domain. Tian et al. [
22] introduced a relation extraction method based on an attention mechanism and graph neural convolutional networks. The research on joint models has increased, with Yuan et al. [
23] developing a specific Relation-Aware Self-Attention Network (RSAN) to improve the relation extraction performance. RSAN uses a relation-aware attention mechanism to construct specific sentence representations for each relation, which are followed by sequence labeling to extract the corresponding head and tail entities. In the field of medical entity relation extraction, Dou et al. [
24] utilized recurrent neural networks and convolutional neural networks for feature extraction from positional embeddings and external key text information, resulting in a medicine interaction entity extraction model. Ding et al. [
25] combined attention mechanisms with bidirectional Long Short-Term Memory networks for relation extraction in the medical domain, aiming to address the issue of data scarcity in Chinese biomedical entity relation extraction.
As the demand for specialization in applications increases, general-domain knowledge graphs fall short in quality and depth to meet specialized customization needs. In implementing medication assistance, relying solely on a single subfield of biomedicine is clearly insufficient. Knowledge from multiple domains is required, yet current research often focuses on specific fields, such as traditional Chinese medicine or cardiology, and the functionalities needed for medication assistance are still lacking in existing studies. To address these issues, this paper aims to design a comprehensive ontology for the medical auxiliary medication domain and construct an associated knowledge graph based on this ontology. Through the design of the ontology and the construction of the knowledge graph, the goal is to overcome the limitations of current understanding of the ever-increasing variety of medicines, allowing the knowledge graph to more comprehensively cover various medical fields and ultimately focus on applications related to medication assistance.
Among various NER models, MCBERT-CRF [
26] passes the feature vectors of Chinese medical texts directly to a linear layer to reduce dimensions to the named entity label level, applying CRF for sequence labeling. The NER With Dic [
27] model incorporates dictionary information into Chinese NER by merging word dictionaries with character-level representations, simplifying the sequence processing framework, and improving recognition accuracy and inference speed. The BERT-BiGRU-CRF [
28] model uses BERT to construct character-level vectors, employs bidirectional GRU for deep feature extraction, and utilizes a CRF layer for sequence labeling. In this paper, the above three models are compared against the proposed NER model.
For relation extraction, the Unire [
29] model establishes separate label spaces for entity detection and relation classification, introduces a shared label space to enhance task interaction, and designs an approximate joint decoding algorithm to output the final entities and relations. The PURE [
30] model proposes a simplified extraction approach that only uses the entity model as input for the relation model, sacrificing some accuracy for improved training efficiency. The two models are compared with the proposed model in the following relation extraction experiments.
3. Overall System Design
The system architecture is illustrated in
Figure 1. Initially, web crawlers are used to gather medicine-related information. The data are then screened, filtered, and cleaned according to predefined methods to construct a medicine information database. The collected information undergoes text recognition, and text annotation and relation extraction are performed according to the methods proposed in this paper. This process completes the construction of a medical knowledge graph, which provides the foundation for developing a medication assistance system.
3.1. Access to Medical Data
Many medical websites have sections dedicated to medicine instructions, where most of the data are in semi-structured or unstructured formats. Subsequent entity and relation extraction primarily relies on these data. After investigation, it was found that the medicine instruction data on the YaoRongYun website (
www.pharnexcloud.com (accessed on 7 August 2024)) are relatively complete and well organized, although there is some data duplication. However, after collection, deduplication can be performed, making it a suitable data source.
After collecting data, the candidate entries are subjected to data cleaning. The primary focus is on the names and contents of the entries. Entries with all data fields empty after traversal are discarded. Based on the filtered candidate entries, as shown in
Figure 2, we obtained a dataset consisting of 75,362 medicine instruction documents, which have undergone deduplication, filtering, and cleaning processes.
The primary basis for constructing the medical dataset is the medical knowledge graph framework. Specific information from medicine instruction manuals is selected for entity recognition and relation mapping. After the annotation work is completed, the data are organized to form a comprehensive dataset. The data annotation process involves two main components: entity recognition and relation mapping. Entity recognition entails identifying and labeling terms with domain-specific meanings in the text and classifying them into corresponding entity types. Relation mapping aims to identify the interactions between entities within the text and represent them in a triplet format (subject, relation, object), where the relation defines the connection between two entities. Data are annotated based on the established concepts and relationships. The specific structure is shown in
Figure 3 and
Figure 4.
Referencing the entity annotation standards for medical texts by Zhang et al. [
31] and considering the requirements for medication assistance, this paper defines eight major categories for entities in the corpus. For entity annotation, the “BIO” tagging method is proposed: “B” denotes that the character is at the beginning of an entity, “I” signifies that it is in the middle or end of an entity, and “O” indicates that the character is not part of an entity. The data in the test set were annotated according to this scheme, resulting in a final dataset of 3215 annotated samples.
In the entity label statistics table shown in
Table 1, the label “people” is further specified into its subclasses during the actual recognition process. For example, the category “people” is subdivided into specific entity labels such as “elderly”, “children”, etc. For instance, the entity labels for the “elderly” category are B-OLD and I-OLD. Similarly, the entity labels for “children” are B-KID and I-KID. For “pregnant women”, the entity labels are B-PRG and I-PRG.
In this paper, the dataset used does not exhibit class imbalance issues. However, when using other datasets, techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and weighted loss functions can be employed to perform preprocessing for imbalanced training data:
Synthetic Minority Over-sampling Technique (SMOTE): It addresses class imbalances by generating new minority class samples through interpolation between existing minority samples. This approach effectively increases the number of minority class samples, enhancing the model’s ability to learn from these classes while avoiding overfitting. However, the generated synthetic samples might introduce noise, and the computational complexity of nearest neighbor searches can be high.
Weighted Loss Function: This method assigns higher weights to minority class samples during training, compelling the model to pay more attention to these samples. It is straightforward to implement and does not require altering the data distribution, making it directly applicable to existing models and loss functions. Nonetheless, the selection of appropriate weights is critical; improper weighting can lead to instability during training, and the method may be less effective for datasets with extreme imbalance.
3.2. Pharmaceutical Entity Extraction Based on MCBERT-BiGRU-CRF Modeling
Entity extraction is crucial for identifying different entity categories within a text, which is essential for completing tasks such as information extraction, knowledge graph construction, and question-answering systems in Natural Language Processing.
This section addresses the characteristics of high specialization and contextual relevance in the Chinese medical domain by proposing a new model, MCB-CRF. This model is based on BERT word embeddings and is specifically fine-tuned for the Chinese medical corpus. Its structure is illustrated in
Figure 5.
Character Embedding Layer: Biomedical texts possess unique characteristics compared to general texts, such as specialized medical terminology, diverse vocabulary sources, and precise expression. In the Chinese text, there are often no clear boundaries between words, and the use of punctuation may vary in meaning. The MC-BERT model, designed specifically for Chinese medical texts, is based on the BERT architecture and uses characters as input to the model. This design allows the model to directly incorporate medical entity knowledge, enabling it to learn the semantics of words after obtaining word embeddings, enhancing performance in downstream tasks such as NER.
To input data into the network in tensor form, samples of varying lengths within a batch need to be padded or truncated to a uniform length of the preset maximum length
. The unified text is then tokenized using a tokenizer to obtain the ID representations of characters in the dictionary. These IDs are subsequently fed into the embedding layer to obtain character-level embeddings, as indicated for (1) and (2).
The maximum length of the text minus the current text length is the number of fills. The dimension of the final output vector of (2) is , where is taken as 768 in the pre-trained model.
Feature extraction layer: For Chinese medical texts, changes in the order or context of words will affect the meaning of the context. The impact of the features of contextual semantics should be considered to further subdivide the granularity of medical knowledge.
The Gated Recurrent Unit (GRU) is a simplified version of the Long Short-Term Memory (LSTM) network that effectively addresses the issue of long-distance dependencies in sequences. GRU merges the forget gate and the input gate of LSTM into a single update gate and simplifies the cell state. The reset gate
regulates the combination of the current input
and the hidden state output
from the previous time step, as calculated by (3). The update gate combines the functionalities of the forget gate and input gate in the LSTM structure, determining whether to retain information from the previous time step or incorporate information from the current time step. The relevant calculation is shown in (4).
represents the weight matrix required to be trained for the reset gate,
refers to the bias term of the reset gate,
is the weight matrix required to be trained for the update gate, and
is the bias term of the update gate. The output of GRU is determined by (5), where
refers to the candidate information of the hidden layer at the current time point, and its calculation method is given by (6).
This study introduced a bidirectional GRU (BiGRU) unit in the model design to extract the feature vector X of the Chinese medical text and its context feature vector H, as described in (7). The output of the backward GRU module and the forward GRU module of
and
are concatenated to obtain the output of the BiGRU module.
Decoding layer: For labeling medical texts, the sequence labeling task for characters predicts the label for each character. We need to determine the effectiveness and reasonableness of the recognition based on the contextual association of the labels in medical texts. When using deep learning models for NER in medical texts, the output of feature extraction is often combined with a Conditional Random Field (CRF) layer. The CRF layer can make adjustments through its constraint mechanism to avoid errors and constrain the final output by leveraging the interdependence of labels, reducing the error rate and achieving effective label decoding, ultimately obtaining the most probable correctly labeled sequence of sentences.
When the conditional probability distribution
of the input observation sequence
can be represented by an undirected graphical model composed of the state sequence
, the distribution of
with respect to
is regarded as a conditional random field; that is, it satisfies (8).
The probability distribution density
at this time is given by (9).
where
is the transfer characteristic function, which is used to measure the influence between adjacent state variables,
is the state characteristic function, which mainly measures the influence of the observation sequence on the state variable,
and
are the weights of the corresponding characteristic functions, and
is the normalization factor, which is used to ensure that the entire formula is a probability. The calculation method of
satisfies (10).
For a label sequence, the label at a given time is determined only by its corresponding observation sequence and the labels at adjacent time steps. In practical applications, we first calculate the set
of all possible label sequences for the observation sequence and finally output the sequence
with the largest conditional probability, as shown in (11).
The pseudo-code for model training is shown in Algorithm 1.
Algorithm 1. MCB-CRF named entity recognition algorithm. |
Input: Training Set , Test Set , MCBERT Model Parameter Table , Number of Iterations , Batch Size |
Output: Entity Extraction Model , Accuracy , Recall , Average |
1: Initialize with the parameters of the pre-trained model 2: for to do 3: for to do |
4: //Get sentence and tag information 5: //Input to the embedding layer to obtain word vector 6: //Get the forward GRU output 7: //Backward GRU output 8: //Bidirectional GRU output 9: //CRF layer output results 10: //Calculate the loss function and update the model parameters 11: end for 12: //Test accuracy using the test set 13: if then 14: //Save the model that performs best on the test set 15: end if 16: //Calculate evaluation indicators through the model 17: //Calculating the F1 value 18: end for |
19: return M, pre, recall, F1 |
3.3. Chinese Medical Text Relation Extraction Based on FF-PCNN Model
In relation extraction, the text sequence and the existing entity set are typically predetermined. The goal is to determine whether a relationship exists between two entities and, if so, to identify the type of the relationship. This paper leverages deep learning techniques to improve upon existing relation extraction models, proposing a fusion feature relation extraction model, FF-PCNN (Fusion Features, FF; Piecewise Convolutional Neural Network, PCNN). This model effectively utilizes the positional information between entities and the textual information surrounding them, incorporating entity type information into the features to effectively capture the relationships between entities. As shown in
Figure 6, the model consists of four parts: a word vector generation layer, position vector embedding layer, PCNN layer, and output layer.
Word vector generation layer: To better understand the meaning of medical texts, the model often processes medical text by mapping each character to low-dimensional dense word vectors through a word vector generation module. This process converts high-dimensional natural language into a form that processors can understand and handle. In this paper’s task of relation extraction in medical texts, the same word vector generation scheme used for named entity recognition is employed to better adapt to the vectorization process of Chinese medical texts. To extract local features, the text is converted into character-level embedding using a pre-trained model. Its dimension is given by the output dimension of the previous word embedding layer and the number of vectors .
Position vector embedding layer: In the task of extracting relationships between medical entities, the positional information of characters relative to entities is crucial. The distance between characters and medical entity pairs reflects their relationship to some extent. To address issues arising from the relative position of characters and entities, this section extracts the positional features of characters relative to entities to describe this positional information. For example, in the sentence “安喘片治疗咳嗽” (Anchuan tablets treat cough), the character “治” (treat) has relative distances of 1 and −2 to the head entity “安喘片” (Anchuan tablets) and the tail entity “咳嗽” (cough), respectively. Define the character
at position
in the text sentence. The position of the leftmost character of the entity relative to the current text is recorded as
. The position of the rightmost character of the entity relative to the current text is recorded as
. The class of an entity is denoted by
. Then, the entities in the sentence can be represented as
, and the calculation method of the position
of character
relative to entity
is given by (12).
The lookup table converts it into the corresponding vectors
and
and then concatenates the two to obtain the relative position embedding
. The calculation formula is as follows (13):
The final concatenated word vector representation is shown in
Figure 7. Finally, the feature representation
of the input character is obtained by concatenating the character embedding
and the position embedding
, as shown in (14).
Feature extraction layer: After the previous processing steps, the features of character and positional vectors in medical texts can be obtained. To better learn these two features, this chapter employs the PCNN model to extract the position features of entities and the contextual features around the entities. By concatenating the character relative position features with the character vectors as inputs, the PCNN can deeply encode and extract the contextual information of the medical text and the relative positional information between entities. This approach enhances the accuracy and efficiency of the relationship extraction task in medical texts. The calculation method for the aforementioned input features is shown in (15).
The PCNN model slices the output of the convolution filter of a sentence containing two entities based on the positions of the entities and then performs max-pooling on these slices. This approach effectively captures the semantic information of the sentence, improving the performance of the relationship extraction model. The schematic diagram of the model is shown in
Figure 8. Assume that there are two entities
and
in the sentence whose relations are to be extracted. From the above, we know that
and
represent the position of the first character of the current entity in the text, and
and
represent the position of the last character of the current entity in the text. The output feature vector
of PCNN can be calculated from (16) to (19).
The pseudo-code for feature extraction using PCNN is shown in Algorithm 2.
Algorithm 2. PCNN feature extraction algorithm. |
Input: Number of training sessions . Batch size . Character vector concatenated with the position vector x |
Output: After piecewise convolution, the output vector y 1: for to do: 2: for to do: |
3: //Transpose x to the format of the convolutional layer input 4: //Perform a one-dimensional convolution on x 5: for to do 6: maxpool //Perform maximum pooling on features in segments |
7: end for 8: //Concatenate features to obtain the output vector 9: end for 10: end for |
11: return y |
Output layer: The overall algorithm process of the relationship extraction model is shown in Algorithm 3.
Algorithm 3. FF-PCNN relation extraction model algorithm. |
Input: Training set , Test set , MCBERT model parameter table , Iterations k, Batch size L |
Output: Relation extraction model R, Accuracy , Recall , Average value |
1: Initialize with the parameters of the pre-trained model 2: for to k do 3: for to do |
4: //Get tags and entity information 5: //Input to the embedding layer to obtain word vector 6: //Get relative entity position information 7: //Splicing to get joint feature information 8: //The features are obtained through segmented convolution 9: //Generate entity vector 10: //Concatenate to get the joint feature vector 11: //Predicting relation classification using fully connected layers 12: //Calculate loss and update model parameters 13: end for 14: //Test accuracy using the test set 15: if test_loss < test_loss_min then 16: //Save the best performing model on the validation set 17: end if 18: //Calculate evaluation indicators through the model 19: //Calculate F1 value 20: end for |
21: return R, pre, recall, F1 |
The output layer of PCNN and the entity feature information are concatenated and input into the output layer as a new vector. The information extracted by the feature extraction layer includes character features and the position between entities. The entity features can be obtained by extracting the entity type. The entity type features and the features of the feature extraction layer are concatenated according to (20) to obtain the final input vector
of the output layer, and the indexes are obtained by mapping the subject entity
and the object entity
in the word vector generation layer, respectively.
After merging the features output by PCNN with the entity category features, the features are further learned through the fully connected layer to obtain the output
. Finally,
is mapped from vector to probability by applying a Softmax classifier, and the relation category with the maximum probability value is selected as the output. Here,
is the parameter matrix to be learned,
is the bias parameter to be learned,
is the set of sentences processed by the model, and
is the predicted probability of each type. The calculation method is shown in (4)–(11) and (21).
The FF-PCNN model proposed in this paper has demonstrated good performance in extracting relationships from Chinese medical texts. However, the interpretability of such models is relatively low. This is due to the FF-PCNN model’s multiple neural network layers, including the MCBERT embedding layer, positional embedding layer, PCNN feature extraction layer, and output layer, each containing numerous parameters and complex computations. Consequently, understanding the internal workings of the model is challenging. The model integrates character features, positional features, and entity type features to generate the final input vector. This feature fusion approach is quite complex, making it difficult to analyze which features significantly impact the model’s final output. Techniques like Local Interpretable Model-agnostic Explanations (LIME) [
32] or SHapley Additive exPlanations (SHAP) [
33] can be used to analyze which features have a significant impact on the model’s predictions. This helps in understanding how the model identifies relationships between entities. Additionally, visualization techniques such as heatmaps or attention mechanism maps can be employed to show which regions or features the model focuses on, aiding in understanding how the model interprets medical texts and performs relationship extraction.
4. Experimental Results
The experimental platform configuration of this article is shown in
Table 2.
The specific evaluation indicators used in this experiment and their definitions and calculation methods are as follows:
Precision (P) is the ratio of the number of samples correctly predicted as positive by the model to the total number of samples predicted as positive. This metric reflects the accuracy of the model’s predictions. The calculation is shown in (23), where
TP represents the number of samples that are predicted as true by the model and are actually true, and
FP represents the number of samples that are predicted as true by the model but are actually false.
Recall (R) is the proportion of actual positive samples that are correctly predicted as positive by the model. This metric indicates the model’s ability to identify true positives. The calculation is given in Equation (24), where
TP is defined as in Equation (23), and
FN represents the number of samples that are actually positive but are predicted as negative by the model.
The
F1 score is the harmonic mean of
P and
R, which is calculated as shown in (25).
4.1. Experimental Verification of Medical Entity Extraction of MCBERT-BiGRU-CRF Model
The NER experiment in this study involves several steps: first, collecting the required data from the internet, followed by data cleaning, filtering, and labeling as part of the preprocessing. Next, the preprocessed data are used to fine-tune a pre-trained model. Subsequently, the semantic features of the sentence context are extracted using a bidirectional GRU layer, and the Softmax function is employed to assign the most probable label to each character. Following this, a CRF layer is used for decoding to produce the optimal label sequence for the input. Additionally, the model’s performance is evaluated using a test dataset, and relevant performance metrics are employed to assess the model’s effectiveness.
The optimal experimental settings are determined by testing various dropout rates and learning rates. The model performs best when the dropout rate is set to 20% and the learning rate is set to 0.005. A comparison of different dropout rates is shown in
Table 3.
Table 4 shows the recognition performance of the MCB-CRF model on different entity types, using the F1 index as the evaluation criterion. The results show that the model achieves good results on most entity types.
To evaluate the effectiveness of the proposed named entity recognition (NER) model on Chinese medical texts, this section selects several methods that have shown excellent performance in the NER field in recent years for comparative testing. The initial hyperparameters of the models are uniformly set, and all training is conducted using the same dataset. The specific models are as follows:
MCBERT-CRF [
26]: The feature vector X of the Chinese medical text is directly passed to a linear layer to reduce its dimension to the level of named entity labels, and then CRF is applied to perform the sequence labeling task.
NER With Dic [
27]: A Chinese named entity recognition technology that introduces dictionary information. This method avoids building a complex sequence processing framework by integrating word dictionaries into character-level representations. This model is characterized by its simple structure, fast reasoning ability and cross-platform capability. At the same time, the added vocabulary information helps the model more accurately determine the boundaries of named entities, which improves the recognition accuracy.
BERT-BiGRU-CRF [
28]: The classic BERT model is used to construct character-level vectors, which are then fed into a bidirectional GRU for in-depth feature extraction and finally passed through a CRF layer for accurate sequence labeling.
The results obtained after training are shown in
Table 5.
By analyzing the above experiments, the following conclusions can be drawn:
Comparing the results of Experiments 1 and 4 reveals that the inclusion of the BiGRU component significantly improves the model’s precision, recall, and F1 score. This indicates that the BiGRU component effectively enhances the model’s feature extraction capability, enabling a more accurate identification of entities related to the features. On the relatively small dataset used in this study, relying solely on MCBERT for feature extraction, is not ideal. This is likely because the BERT model is quite large and typically requires a substantial dataset to fine-tune the parameters for optimal performance. By integrating the BiGRU model, the feature extraction capability of the model is further enhanced beyond what BERT alone can achieve.
Comparing the results of Experiments 2 and 4, it is evident that replacing the word vector encoding component with a pre-trained model on medical and biological corpora significantly improves the model’s recall rate. In Experiment 2, the granularity of words in the dictionary is relatively fine, resulting in less accurate recognition of medical texts containing entities with larger granularity. However, the named entity recognition model using MCBERT for word vector encoding can identify more named entities. Overall, MCBERT outperforms the Word2vec model in word vector encoding, indicating that the pre-trained vector model used in this chapter has superior capability in representing word feature information.
Comparing the results of Experiments 3 and 4 reveals that the pre-trained model focused on the biomedical domain performs slightly better than the model based on a broad domain. This is because the MCBERT model incorporates specialized knowledge from the medical field, enhancing the accuracy of the named entity recognition task. The experimental results confirm that the proposed named entity recognition model outperforms the comparison models in processing Chinese medical texts, further validating the model’s effectiveness.
In the Chinese medical domain, the number of papers related to knowledge graphs is relatively small. Sun [
34] completed entity recognition and relation extraction on Chinese medical texts. It should be noted that the datasets used to train the models are different, so some experimental results may not be directly comparable. In Sun’s NER model, each element in the sequence is converted from a character representation to a vector representation in the embedding layer through a BERT word vector lookup table and dynamically constructed medical entity feature vectors. The sequence vector representation is then input into the encoder, which utilizes a BiGRU incorporating lexical information to automatically extract features and output hidden vectors. Finally, CRF is used as a decoder to predict the entity positions and entity type labels corresponding to each character in the sequence. According to Sun’s experimental results, the accuracy is 74.18% and the recall is 84.24%.
The comparison reveals that while the two models follow a similar general approach, Sun’s model uses BERT for embedding, whereas our model employs a pre-trained bi-directional encoding method using transformers specifically adapted to a Chinese medical corpus. This adaptation results in higher efficiency and better performance.
4.2. Experimental Verification of Chinese Medical Text Relation Extraction Using the FF-PCNN Model
When the entity set
, triple set
and entity category pair set
with possible relationships are defined in a given sentence
, the final training data
also need to be obtained by embedding the identifier into the sentence. The steps are given by (26), and the calculation of the final relationship
is shown in (27).
Among them, and are the subject entity and object entity in this sentence, and are the categories of each entity, and is the number of data to be processed in the dataset. represents the direct product between sets. The subsequent experiments in this section are conducted using the processed dataset.
The training parameters of the model in the experiment are shown in
Table 6.
Table 7 presents the recognition results of FF-PCNN for various entity relations. According to the experimental results, the proposed model demonstrates good performance in identifying most entity relationships. For the entity types “medication” and “disease”, the relationships may fall into categories such as “treatment” and “adverse reaction”, which can lead to some misclassifications during the recognition process. As a result, the overall accuracy is slightly lower. However, the relationship recognition performance is generally good.
Next, the effectiveness of the proposed relationship extraction model will be evaluated. This section compares the proposed model with representative distant supervision relationship extraction methods from the perspective of deep learning through comparative experiments:
Unire (Wang et al., 2021) [
29] establishes two independent label spaces for entity detection and relation classification, and it also proposes a method for sharing label spaces to facilitate interaction between the two tasks. Additionally, it designs an approximate joint decoding algorithm to output the final extracted entities and relations.
PURE (Zhong and Chen, 2021) [
30] introduces a simple pipeline extraction model that integrates entity information early in the relation model and consolidates global context representations. It uses only the entity model as input for the relation model, sacrificing a slight reduction in accuracy for improved training efficiency.
By replicating the above comparison algorithms using the dataset in this study, the experimental results are shown in
Table 8 and
Figure 9. It can be observed that the proposed model demonstrates better extraction performance compared to the benchmark models. Compared to Unire, the FF-PCNN model has a simpler structure and achieves better feature extraction performance. This indicates the effectiveness of the proposed feature extraction approach for entity types and relative positions. Compared to PURE, the proposed model’s F1 score exceeds PURE by 1.6%. Upon analysis, while PURE also utilizes entity type information for feature concatenation, it merely concatenates entity types and character-level vector embeddings before classification. In contrast, the proposed model not only incorporates entity type information but also integrates contextual information and relative position information between the subject and object entities. This allows for a better extraction of hidden semantic information in the text, enhancing the accuracy of the relation extraction task. These results demonstrate the effectiveness of the proposed model’s design.
Sun employs a joint extraction method for relation extraction, constructing specific sentence representations for each medical relation based on an attention mechanism. This method calculates attention scores for each character under different medical relations, using the entity recognition model proposed by Sun to directly output the entity relations contained in the text. According to Sun’s results, the accuracy is 91.6% and the recall is 89.3%.
The comparison shows that the relation extraction methods used in this paper and Sun’s paper are fundamentally different. The models for named entity recognition and relation extraction in this paper are independent of each other, making the system easier to implement and more flexible while maintaining a high level of accuracy.
4.3. Rational Medicine Use System Demo
This system can conduct dosage reviews by examining the medication methods for specific populations. It assesses patient information to determine the category to which the patient belongs. Then, it compares the prescribed dosage and usage information with the data in the knowledge base for that medication to judge whether the prescription is appropriate for the patient. After the physician reviews and approves the prescription information, the corresponding prescription details are updated in the patient’s prescription information interface. By clicking the prescription information button in the patient information interface, one can view the prescription details and the corresponding disease knowledge graph. In this graph, the color of the recommended medication nodes, ranging from green to black, indicates the level of recommendation, with green being recommended and black not recommended. The prescription information page is shown in
Figure 10.
5. Conclusions
In recent years, with the continuous development and promotion of intelligent medical systems, the integration of auxiliary medication and computer technology has become increasingly important in addressing the needs and practical scenarios of medication usage. The research on knowledge graphs in the medical domain has thus gained significant importance. This study delves into the construction of a medical knowledge graph using natural language processing and deep learning technologies. Through the design of ontologies and the study of knowledge extraction algorithms in the medical field, a Rational Medicine Use System based on a domain-specific knowledge graph has been constructed.
To address issues such as the misrecognition of medical-specific terms and long training times in existing pipeline extraction processes, this study avoids the traditional word segmentation and embedding method. Instead, it employs a character-level embedding approach for word vector representation. This approach effectively recognizes medical entities. To address the challenges of comprehensively utilizing contextual semantic information, entity type information, and relative position information in medical texts, this study adopts a model that integrates the positional information of entities and contextual information to form a combined vector. This combined vector is further fused with entity type information to output relationship probabilities. This approach predicts relationships between entities in Chinese medical texts, providing technical support for the extraction of entity relationships in medical knowledge graphs.
In future research, several directions can be pursued to enhance and expand the current study. In this study, a pipeline approach was used for information extraction in the medical domain. Although this method shows improvement over previous models, it still suffers from the drawback of error accumulation. The experiments for named entity recognition and relation extraction were conducted independently, yet the performance of NER can influence relation extraction during the knowledge extraction process. With the growing research on joint extraction methods, future work could explore the application of joint extraction of entities and relations in the medical field for Chinese texts. This approach could potentially mitigate the error accumulation issue and enhance the overall extraction accuracy.