1. Introduction
Knowledge graphs in the medical area have become enormously relevant to support medical research development and to facilitate the exchange of clinical data and scientific advances. That is the case with the well-known NCBO BioPortal, a medical ontology repository where multiple health institutions publish their ontologies or knowledge graphs. In this article, we describe a modularization-based method for the reutilization of portions of medical knowledge graphs into a general model, maintaining reasoning efficiency.
Regarding the modularization of knowledge graphs, this topic has been widely studied from the perspective of Ontological engineering. Therefore, we consider the works developed for the modularization of ontologies as applicable to knowledge graphs. However, it is important to clarify the definitions of ontology and knowledge graphs.
The term ontology in Computing can be traced back to the 1990s, when authors such as Gruber [
1], Guarino [
2] and Uschold [
3] presented formal definitions and discussions regarding possible interpretations. Taking these definitions as a reference, in this paper we consider that an ontology consists of the following elements and requirement:
- (a)
A set of concepts or classes, which may be referred to as a vocabulary of terms or conceptualization.
- (b)
A set of semantic relations between concepts, which may be taxonomic or hierarchical relationships, and general relationships between concepts (not taxonomic).
- (c)
A set of individuals or instances.
The set of elements mentioned above must satisfy the requirement of formality, that is, they must be defined by logical axioms in such a way that misinterpretations of the definitions are avoided.
On the other hand, the concept of knowledge graphs emerged more recently and was popularized by Google; since 2012 it has gained great relevance derived from the large number of knowledge graphs publicly available, in addition to the ease of access to them through query languages such as SPARQL or through REST APIs.
However, the formal concept of knowledge graphs is still vague; for example, Lisa and Wolfran [
4] present an analysis of various definitions. Heist et al. [
5] revised few definitions and stated that the following interpretations are possible:
Knowledge graphs are used to:
- (a)
describe real world entities and their relationships organized as graphs;
- (b)
define classes and relations of entities in a scheme;
- (c)
allow any entity to be related to other entities;
- (d)
they cover various domains of knowledge or topics.
Taking into consideration the mentioned references, we can observe that there is a generalized notion that the term knowledge graph refers to real data instances (in fact, a knowledge graph can refer to a large amount of data or resources), while ontologies have a high formalization requirement. Another difference is the type of representation or syntax used; in ontologies, a description logic-based syntax is mainly used, while in knowledge graphs a triplet-based language such as RDF or Turtle is used. In this article, we take as a reference the related works of modularization applied to ontologies and we consider them as relevant to propose a method of reusing knowledge graphs through modules. This is despite the fact that knowledge graphs and ontologies have differences; in practice it is possible to transform an ontology based on description logic syntax into a knowledge graph. In fact, the set of knowledge graphs used in this article are referred as biomedical ontologies; however, we are using the triplet-based representation, specifically RDF.
Reutilization of knowledge graphs occurs during the design and integration of multiple semantic representation models (ontologies or knowledge graphs) with the aim of generating a complete representation model to solve the requirements of intelligent information systems or expert systems. Reutilization of a knowledge graph or ontology is the task of completely or partially importing one model into another. The reuse of a Knowledge Graph may involve carrying out a process of modularization and adaptation or mapping from one model to another.
The process of reusing a Knowledge Graph begins with the search and selection of the knowledge graphs or ontologies that are considered appropriate to complete the definitions required by the system. Once the knowledge graphs to be reused have been identified, it is necessary to review in detail how this integration will be carried out. To address this problem, the following questions should be considered:
Is it necessary to import the entire model or is only a part of the model required?
Is it necessary to adjust or make adaptations between the imported model and the host model?
What is the most suitable modularization approach?
What is the computational cost of each option?
This article describes a general method for generating and reusing knowledge graph modules. This method arises from the need to integrate a general knowledge graph for managing electronic patient records. During the process of designing and integrating the general knowledge graph, a great disadvantage was observed in reusing some biomedical ontologies. This is mainly because most of these ontologies are very large and complex. Therefore, it was necessary to develop a method that would allow obtaining useful information for the general graph by extracting modules.
As a result, four modules of knowledge graphs were obtained, which were imported into the general knowledge graph. The resulting knowledge graph functioned adequately for the information needs required for the project.
The rest of the paper is organized as follows: in
Section 2, a review on the modularization of knowledge graphs is presented;
Section 3 presents the method to generate and reuse medical knowledge graph modules; in
Section 4, the application of the method to generate the Medicament knowledge graph module is presented; in
Section 5, the application of the method to generate the Disease knowledge graph module is described; in
Section 6, the integrated knowledge graph modularization system is presented; in
Section 7, two evaluation approaches are described; and finally, in
Section 8, conclusions are presented.
2. Modularization of Knowledge Graphs
One of the first attempts to address the modularization of representation models was reported by Alan Rector in 2003 [
6], who stated the importance of modularity as a key requirement for ontologies to achieve reutilization, maintainability, and evolution. Modularization of ontologies was defined as the task of decomposing an ontology into independent disjoint skeleton taxonomies.
In 2006, Schlicht and Stuckenschmidt [
7] described a modularization approach aimed at supporting the distribution of knowledge in a P2P network. The authors describe the requirements of reasoning efficiency, robustness, and maintainability and proposed the following structural criteria: connectedness of modules (the efficiency of distributed computation is closely related with the degree of interconnectedness of the generated modules); size and number of modules (this criteria has a strong impact on the robustness); and redundancy of representation. (the use of redundant representations will improve robustness at the cost of increasing maintenance). The goal of this approach is to apply modularization on a given ontology with the purpose of splitting it by considering numerical aspects such as size, number of modules, and module interconnectedness.
From a logic-based perspective, a series of contributions have been reported by a research group led by Bernardo Cuenca and Ian Horrocks. In 2006 [
8], Cuenca et al. described the notion of a module from a model–theoretic perspective as a self-contained unit within the ontology. They stressed additional module requirements such as the scope of the module, its size, and the correct interpretation of the module. Later, in 2007 [
9], Cuenca et al. presented an approach to extracting modular fragments from ontologies, preserving the minimal conditions. They defined a module with respect to a signature S; aimed at reducing the cost of importing external ontologies, they proposed to import only a fragment of the given ontologies, preserving the meaning of the terms imported.
Based on the aforementioned references, a module can be defined as a self-contained knowledge unit within the knowledge graph, where a module has a clearly defined scope, size, and a correct interpretation of the module.
Meanwhile modularization can be defined as the task of decomposing a knowledge graph into independent disjoint modules, splitting it considering modules size, number of modules, and module interconnectedness.
Accordingly, knowledge graph modularization should address the following criteria:
- (a)
Facilitate reutilization and the evolution of knowledge graphs.
- (b)
Maintain reasoning efficiency.
- (c)
Keep the balance between robustness and maintainability.
- (d)
Enable distribution of knowledge in open networks.
- (e)
Support connectedness of modules.
The concepts of module and modularization are closely related to the process of reusing knowledge graphs, because the purpose is to extract modules from knowledge graphs so that they can be imported into another model, ensuring that the semantic definitions and relations are not lost while maintaining the efficiency of reasoning. Therefore, in the rest of this article we will use the concept of the Knowledge Graph Module (KGM) as the resulting product of the modularization process.
In 2009, Pathak et al. [
10] presented a survey on modular ontology techniques that are based on logical formalisms and graph theories. They state that modular ontology techniques are crucial for the biomedical domain, because most popular ontologies are large and complex. Therefore, the development of tools for managing multiple distributed ontologies will benefit reasoning performance. The authors establish that an ontology module should be self-contained and logically consistent. Also defined were the goals of ontology modularization: partial reuse, complexity, ownership and customization, efficient reasoning, and tooling support. Finally, the authors outlined evaluation criteria and requirements: localized semantics, correct reasoning, transitivity, safe reuse, and decidability.
In Courtot et al. [
11], the authors presented the Minimum Information to Reference an External ontology Term (MIREOT) guidelines to aid the development of the ontology of Biomedical Investigations (OBI). The purpose of this guide is to import the minimum required reference from an external ontology; for this the authors propose three cases, depending on what is required as a reference. All of them generally require that the identifier or URI (currently IRI) of the class or term reference be included. The method described in this article agrees to keep the unique identifier of each concept or class, but includes more relevant details of each concept, avoiding an overload in the memory of the reasoner. The main disadvantage of modularization approaches such as MIREOT’s is that they seek to import all references to classes that depend on or are semantically related to maintain logical consistency. In the method that we present in this article, the logical consistency is guaranteed by the transformation process used, which depends on a very simple design of classes and semantic relations, and that reuse refers only to concepts.
One of the main tradeoffs of ontology modularization is to determine the level of expressiveness that needs to be maintained from the original ontology. However, the higher the level of expressiveness, the lower the ability to reason with large ontologies, and the efficiency will be very poor. Regarding this problem, Algergawy et al. [
12] introduced the Ontology Analysis and Partitioning Tool (OAPT), a framework for analyzing and partitioning ontologies. However, the main approach used by the authors was to partition an ontology into modules and then apply evaluation criteria. Instead, in this article we focus on the extraction of specific information or knowledge that needs to be included in another knowledge graph for a particular purpose. The original graphs are not partitioned into modules.
In 2012, Rector et al. [
13] presented use cases for the modular development of ontologies by using the import mechanism of OWL. According to the authors, ontology modularization concerns two topics: module extraction, which consists of separating existing ontologies into modules; and modular development, which resembles modular software engineering. In this article we present a modularization method that combines aspects of both cases; on the one hand, the general design of the knowledge graph is carried out with a modular development approach from the beginning, and on the other hand, we present a semi-automatic method to perform the extraction of modules for reuse.
Slater et al. [
14] described an algorithm to evaluate OBO (Open Biological ontology) ontologies. As a result, they identified a prevalence of unsatisfiable class axioms, and encountered that the axioms belong to the BFO (Basic Formal ontology), which is the upper ontology that most OBO ontologies adhere to. The authors concluded that the BFO is not used consistently among all the OBO ontologies that implement it. Taking this into consideration, our modularization and reuse approach completely avoids importing the entire ontologies and the references they include from the BFO. No inconsistencies will be produced by the reuse of the knowledge graph modules produced by our method.
In 2021 Shimizu, Hammar, and Hitzler [
15] described the reasons why ontologies are not frequently reused as follows: differing representational granularity, lack of conceptual clarity in many ontologies, lack and difficulty of adherence to good modeling principles, and lack of support in ontology engineering tools.
Another relevant issue that we deal with in the modularization method described in this paper is related to the way biomedical ontologies are developed, most of which do not define individuals or instances at the assertional level (A-Box); they are mainly designed at the terminological level (T-Box). This implies that every concept defined is represented as a class into a class hierarchy, but no individual is defined as a member of any class. This represents a difficulty for their reuse in the target knowledge graph; in the case of establishing an is_A relationship between an object and a class, the interpretation is that the object is a member of that class, which generates misleading interpretations.
According with Zubeida and Maria [
16], there are three different approaches for ontology modularization: as graph partitioning, logic-based approaches for locality, and queries. They also mention the manual creation of modules as a solution to specific needs. In this work, the authors describe an abstraction-based modularization method. Our modularization method is similar to their abstraction algorithm in the sense that there is a transformation of the representation model from classes to individuals, which can be understood as an abstraction of the representation. However, this type of abstraction has not been addressed or described previously.
Although there are various approaches and studies presented for the reuse of ontologies or knowledge graphs, there are still difficulties for this practice to become a reality. In this paper, we present a module extraction approach that depends on a careful design of the concepts and relationships that are required to be reused. The main characteristics and benefits of the method are:
- (a)
Logical consistency; in the method presented in this paper, logical consistency is guaranteed by the transformation process used, which relies on the manual design of classes and semantic relations. Once the base definitions are created, the instantiation of individuals is automatic.
- (b)
We focus on the extraction of specific information or knowledge that needs to be included in another knowledge graph for a particular purpose. The original graphs are not partitioned into modules.
- (c)
The modularization method completely avoids importing the entire ontologies and the references they include from the BFO.
- (d)
An important transformation is carried out; there is a transformation of the representation model from classes to individuals.
- (e)
The reported method is repeatable, so it is possible to build a tool that assists the designer or integrator in the reuse process.
3. Method to Generate and Reuse Medical KGM
To describe the analysis and decision-making process that was carried out, this section first presents the system that requires a method for reusing knowledge graphs. Specifically, the analysis of information requirements of the system requires the search, selection, and reuse of various external knowledge graphs. This system is based on the specification of a comprehensive knowledge graph to support the representation and management of the Electronic Health Records (EHR) of patients. We start from a base model that was designed with the aim of representing patient profiles. This base model was first reported in [
17].
Figure 1 shows the general EHR model, which depicts in highlighted color the concepts that require the reutilization of medical vocabularies: Medicament, Disease, Symptom, and Laboratory.
As a result of the analysis of the conceptual requirements of the initial representation model, it was necessary to find reliable and scientifically valid sources of information for the inclusion of concepts regarding medical knowledge. Therefore, to complete this model, we decided to reuse vocabularies and definitions from external references. A series of searches were executed in BioPortal, resulting in the list of ontologies shown in
Table 1.
To reuse each of these knowledge graphs, various methods exist; for instance, importing full models into a general knowledge graph. However, importing a complete model is not a good alternative; this is because the number of axioms that are imported are added, making the integrated representation model intractable in terms of the memory resources that would be required to make use of it. Therefore, a different approach should be defined and implemented.
We considered using a modularization-based method for the reutilization of knowledge graphs; this method consists of the transformation of part of the knowledge graph definitions that are relevant for the specific model being integrated. For each of the general concepts highlighted in
Figure 1, we defined and implemented the following procedure (method shown in
Figure 2) that allowed the modularization and reutilization of knowledge graphs.
Select the knowledge graph or ontology that will be reused. The knowledge graph must be downloaded and analyzed locally.
Identify the relevant data or attributes that will be reused to complete the conceptualization into the knowledge graph being integrated.
Define and implement data structures using the object-oriented programming paradigm to allow the clearest and most efficient handling of the concepts of interest.
Implement a program to query the knowledge graph using SPARQL and automatically obtain the list of concepts of interest. SPARQL query language is selected because all knowledge graphs are serialized in RDF syntax; therefore, other query languages such as SWRL are not suitable for RFD serializations.
Define and construct the model of the knowledge graph that will be used to contain the concepts and attributes of interest.
Develop a program to automatically populate the knowledge graph with the list of concepts obtained.
The application of this method for each of the required concepts is described in the following sections.
4. Medicament KGM
For the representation of an her, it is necessary to specify the medications that a patient may be taking, or that will be indicated as part of his pharmacological treatment. Therefore, it is necessary to use a medical vocabulary of drugs.
To obtain a Medicament KGM, we implemented the method to generate and reuse medical KGM as follows:
4.1. Selection of the Knowledge Graph
For the representation of medicament concepts and relations we selected the NDF-RT knowledge graph. NDF-RT is the National Drug File—Reference Terminology produced by the Veterans Health Administration.
Figure 3 shows a snapshot of the NDF-RT model as visualized in Protégé editor.
As can be seen in
Figure 3, the NDF-RT knowledge graph is very complex and is not an intuitive model to understand. This is in part because the medical information it represents is highly specialized and technical. However, the structure of the ontological model is also difficult to understand. For example, class naming is based on a nomenclature established in the design principles indicated by the NCBO for the publication of ontologies in BioPortal. As a result of this naming convention, class names do not provide any clue of which drugs they represent; it is necessary to expand the classes to identify the class that groups the entire hierarchy of drugs. Additionally, it is necessary to explore all the annotations to understand the specific details of each concept.
Another problem that is observed with this knowledge graph is the fact that all the represented concepts are organized mainly using class definitions, that is, there are no individuals as member of the classes, which represents a difficulty in establishing a relationship between objects according to what is established in the base model shown in
Figure 1.
Despite the difficulties that this model presents, NDF-RT is quite useful as it can be reused within the EHR because it defines concepts that relate drugs to diseases, for example, and may-treat and may-prevent relationships, which we consider relevant.
4.2. Identification of the Data and Attributes to Be Imported
To review and analyze, the NDF-RT knowledge graph was saved in Turtle syntax. For the extraction of medicament products, we selected only the “Drug Products by VA Class”, which represent clinical pharmaceutical products that can be ordered. As an example, the code shown in
Figure 4 describes the concept N0000160630, which identifies the drug product “Metformin hydrochloride 500 MG Oral Tablet”. Based on this code, we can identify the important attributes that should be considered, for instance, the RxNorm_Name, the RxNorm CUI, units, strength, and NUI, among others.
The set of relations or roles that NDF-RT uses can be extracted from a different part of the definitions, the code shown in
Figure 5 presents the set of drug relationships with other concepts, for example, mechanism of action, physiological effects, clinical kinetics, among others.
4.3. Define and Implement Data Structures
To extract and represent medicament information using an object-oriented paradigm, the JavaTM classes shown in
Figure 6 were implemented. The Medicament class includes the attributes that were selected for their relevance. It should be noted that a lighter extraction of the concept attributes is performed. On the other hand, the Property class is used as a helper class in extracting all the important relationships that drugs can have with other concepts. The OntologyUtils class is an important class in which all the information extraction methods have been implemented, as well as the ontology population methods. The DrugOntologyCreation class represents the main program in which the data extraction, transformation, and module generation tasks are executed in sequence.
4.4. Implement a Program to Query the Knowledge Graph
To obtain the information from the NDF-RT knowledge graph, we implemented a JavaTM program using the API RDF4J that allowed issuing SPARQL queries through which a list of objects of the previously defined types were generated. The purpose of the first query in
Table 2 is to obtain the list of medications and their basic attributes, and the second query obtains the entire set of relationships that are all characterized as properties.
4.5. Define and Construct the Model of the Knowledge Graph
For the construction of the Medicament KGM that represents the data regarding drugs, it is necessary to design and implement the conceptual model of the target module graph, which is the T-Box of the ontology. Therefore, and considering the types of data (or objects) returned by queries, the ontological model presented in
Figure 7 was designed. This model represents a different view than the original NDF-RT knowledge graph. The resulting KGM is a simplified and much lighter version of the original; this model has more intuitively named concepts and relationships that facilitate reuse in another knowledge graph. For instance, it is clearer that a given instance of the Medicament concept may treat, or may diagnose, or may prevent any instance of the Disease concept.
According to the NDF-RT documentation, the following classes (concepts) are defined as follows:
- (a)
Medicament. This concept is at the core of the hierarchy in NDF-RT. It includes classifications of medications, generic ingredient preparations used in medications, and orderable drug products.
- (b)
Clinical Kinetics. Represents a collection of concepts describing the absorption, distribution, and elimination of drug active ingredients.
- (c)
Chemical Ingredient. Represents chemicals or other drug ingredients, organized into a chemical structure classification hierarchy.
- (d)
Mechanism of Action. Represents molecular, subcellular, or cellular effects of drug generic ingredients, organized into a chemical function classification hierarchy.
- (e)
Physiological Effect. Concerns tissue, organ, or organ system effects of drug generic ingredients, organized into an organ system classification hierarchy.
- (f)
Dose Form. Represents specific hierarchy of administered medication dose.
- (g)
Disease. Represents pathophysiologic as well as certain non-disease physiologic states that are treated, prevented, or diagnosed by an ingredient or drug product. May also be used to describe contraindications.
The following semantic relations (object properties) are defined as follows:
- (a)
CI Chemical class. This relation is used to specify that a medicament is contraindicated with a chemical ingredient.
- (b)
Has ingredient. Relation used to indicate that a medicament has a chemical ingredient.
- (c)
Has active metabolites. Relation used to indicate that a medicament has active metabolites.
- (d)
Has PE. This relation is used to specify that a medicament has a physiological effect.
- (e)
CI PE. Relation used to indicate that a medicament is contraindicated with a physiological effect.
- (f)
Has dose form. This relation correlates the medicament with its form of dose.
- (g)
Has MoA. This relation is used to specify that a medicament has a mechanism of action.
- (h)
CI MoA. Relation used to indicate that a medicament is contraindicated with a mechanism of action.
- (i)
Induces. Relation used to indicate that a medicament induces a disease.
- (j)
May diagnose. Relation used to indicate that a medicament may diagnose a given disease.
- (k)
May prevent. Relation used to indicate that a medicament may prevent a disease.
- (l)
May treat. Relation used to indicate that a medicament may treat a disease.
- (m)
CI with. This relation is used to specify that a medicament is contraindicated with a given disease.
- (n)
Has PK. Relation that indicates that a medicament has a clinical kinetics (absorption, distribution, and elimination of drug active ingredients).
- (o)
Site of metabolism. Relation used to indicate that a medicament has a site of metabolism.
4.6. Develop a Program to Automatically Populate the Knowledge Graph
Finally, a general JavaTM program was developed for the construction of the Medicament KGM, which integrates the information retrieval programs and the ontology population methods. The general procedure shown in
Figure 8 consists of the following steps: read the list of medicaments from the NDF-RT knowledge graph, recover the properties of the medicaments, populate the Medicament KGM with the list of medicaments, and finally for each medicament instance, populate the list of properties and update the Medicament KGM.
5. Disease KGM
One of the most important concepts of an EHR is that of diseases. There are various definitions of disease in dictionaries or in the medical literature; here we present some general definitions:
“Disease, any harmful deviation from the normal structural or functional state of an organism, generally associated with certain signs and symptoms and differing in nature from physical injury.”
“(an) illness of people, animals, plants, etc., caused by infection or a failure of health rather than by an accident.”
“A disease is an illness which affects people, animals, or plants, for example one which is caused by bacteria or infection.”
To obtain a Disease KGM, we implemented the method to generate and reuse medical KGM as follows:
5.1. Selection of the Knowledge Graph
Considering the management of information for an EHR, a disease is associated with the person or patient to whom the file refers; likewise, a disease is associated with a set of signs and symptoms. Therefore, we have decided to reuse the DOID knowledge graph for this purpose. The Human Disease Ontology (DOID) was developed with the collaboration between biomedical researchers coordinated by the University of Maryland School of Medicine, Institute for Genome Sciences.
Figure 9 shows a partial view of the DOID ontology; as can be seen, this model presents a complex class structure, and the names are not intuitive either. If there is a requirement for obtaining the information of a particular disease, it is necessary to know the specific i.d. with which it has been classified to carry out a search and read all the tags added as annotations of the concept. Of course, with a well-developed search tool, the use of this type of model is feasible, but it still requires knowledge and experience dealing with these kinds of vocabularies.
5.2. Identification of the Data and Attributes to Be Imported
To execute this revision, the DOID knowledge graph was saved in Turtle syntax. The code shown in
Figure 10 describes the concept
DOID_9352, which corresponds to diabetes mellitus type 2 disease. Based on this code, we can identify important attributes and relations, such as: obo:IAO_0000115, which is an annotation property used for defining and explaining the meaning of a class or property; oboInOwl:hasDbXref is a qualifier that allows cross referencing other databases, enabling the recovery of information regarding the disease by querying the included references; oboInOwl:hasExactSynonym is a property used to include synonyms of the disease; and rdfs:label is used to describe the disease.
5.3. Define and Implement Data Structures
To extract and represent disease information as a list of objects, the JavaTM classes shown in
Figure 11 were implemented.
5.4. Implement a Program to Query the Knowledge Graph
To obtain the information from the DOID knowledge graph, we implemented various JavaTM programs using the API RDF4J; these programs execute the SPARQL queries shown in
Table 3. The first query retrieves the data of the diseases from DOID ontology, and the rest of the queries are used to complete the definitions with other important data such as the synonyms of each disease, the parent classes, and the set of database cross references for each disease. It is important to note that the information retrieved and included in the new ontology contains IRI references to enable further queries in case additional information of a particular disease is required.
5.5. Define and Construct the Model of the Knowledge Graph
For the construction of the Disease knowledge graph, it is necessary to design the conceptual model, which is the T-Box of the ontology. Considering the structure of the Disease type returned by SPARQL queries, the ontological model depicted in
Figure 12 was developed. This model mainly represents the Disease concepts with the most important references and definitions to other medical databases, also is a simplified and lighter representation resource compared with the original DOID knowledge graph. The most important difference is the elimination of direct and indirect imported ontologies. The extraction and construction of an ontological representation model without the use of imported ontologies offers benefits in terms of managing memory resources to perform logical inference. Specifically, the representation model developed has basic attributes that allow diseases to be identified without the need to include burden of reference models. If required, the concepts can be expanded through queries using the IRIs and the referenced databases.
5.6. Develop a Program to Automatically Populate the Knowledge Graph
Figure 13 shows the general Java program that was implemented for the construction of the Disease KGM, which starts by executing the program that obtains the list of disease definitions from DOID ontology, then updates the list of disease objects, obtaining the synonyms, the parent class, and the database cross references of each disease. Finally, the program writes the list of disease objects into the Disease ontology file.
6. Knowledge Graph Modularization System
We integrated all the programs developed into a modularization system. This system design is based on the object-oriented programming paradigm, which makes it easy to carry out updates or adaptations in any of the tasks of the modularization process.
Figure 14 shows the class diagram of the modularization system. The objective of this system is to extract the concepts of interest from diverse medical knowledge graphs, represent these concepts using an object-oriented paradigm, and generate specific KGMs that will be reused and usable in the target EHR knowledge graph.
Applying the described method, the LaboratoryTest, Symptom, and Vaccine KGMs were also generated. As can be seen, the method for generating and reusing medical KGM is semi-automatic. This is because it is necessary to know in detail the knowledge graphs from which the modules will be extracted, in addition to the fact that important decisions must be made regarding what attributes and relationships are required in the target graph.
Another relevant aspect that must be taken into consideration is that the medical knowledge graphs are constantly being updated, therefore it is essential to review the changes in the original model and eventually make the pertinent updates in the system.
As a result of the implementation of the method for the generation and reuse of KGMs, four modules were obtained, which are described below.
7. Evaluation
Aiming at evaluating the method reported in this paper, in this section we describe two different approaches to evaluate the resulting modules.
7.1. Evaluation of the Usefulness of the Knowledge Graph Modules
According to Duque-Ramos et al. [
21] “the quality of an ontology module can be defined as the degree of conformance to functional and non-functional requirements”. In the case of the knowledge graph modules generated, they were intended to be used as part of an integral knowledge graph to support the management of EHR. With this goal in mind, we review the general knowledge graph and its usefulness to represent patient data with relationships to these graphs.
Figure 15 shows the EHR general graph diagram, which presents in light blue the predefined concepts and attributes:
Patient,
ClinicalDiagnosis,
StateFederative, and
Municipality; this diagram also shows in dark blue the concepts that come from the imported graph modules:
LaboratoryTest,
NDFRT_
Disease,
PharmacologicalTreatment, and
COVID-19 Vaccine.
It is important to mention that this general knowledge graph for the management of EHRs was populated with 500 clinical records of patients. To show the usefulness of the integrated knowledge graph,
Figure 16 shows an instance of real data from a female patient who was diagnosed with coronavirus; this diagnosis is related to the NDFRT_Disease graph.
Figure 17 shows the specific diagnosis related to the disease (NDFRT_Disease imported graph), with the laboratory test required for the diagnosis (LaboratoryTest imported graph) and with the pharmacological treatment.
Figure 18 shows an example of pharmacological treatment for COVID-19, which is related to the medications (Medicament imported graph) indicated by the doctor.
Figure 19 shows the coronavirus disease related signs and symptoms from the Symptom knowledge graph.
Based on the initial requirements of the EHR management system, it can be determined that the imported graph modules satisfactorily meet the objective of the model. Therefore, from the application and utility point of view, the graph modules generated by the method described in this article are correctly designed and meet the required information needs.
7.2. Evaluation Based on Graph Metrics
To carry out a quantitative evaluation, a comparison of the metrics of the original knowledge graphs versus the generated knowledge graph modules is described in this subsection. The set of metrics used to make the comparison are those generated by the Protégé ontology editor. Among all the metrics presented, the most important is the number of axioms, which despite being very general, offers a good measure of the reduction achieved with modularization. The general idea is to have an estimation of the size of the module extracted and the reduction achieved.
Table 4 shows, on the left side, the original NDF-RT metrics, whereas on the right side the metrics of the Medicament KGM are generated. As can be seen, the extracted Medicament KGM represents 20.33% of the complete NDF-RT. Another important difference is the number of classes used in the Medicament KGM, which is 7, while the original has 46,047 classes. This is mainly because the original graph does not define individuals; it handles everything with classes.
Table 5 on the left shows the metrics of the original DOID knowledge graph, and on the right the metrics of the Disease KGM obtained from the modularization process. Disease KGM metrics show 56% reduction in the number of axioms of the original knowledge graph.
8. Conclusions
One of the difficult problems to solve during the reuse of medical knowledge graphs is that they directly or indirectly import other ontologies or graphs. This generates an overload of models in RAM memory, which prevents reasoning and inference from being carried out more efficiently.
In this article, we have described a modularization method applicable to large knowledge graphs; in particular, in graphs from the biomedical area that, in addition to being large in volume, are highly complex for the user. The described modularization method differs from methods that seek to divide a graph into parts, each of which is self-contained. It is also different from fully automatic methods that extract a module by considering logically complete modules.
The modularization method described is semi-automatic and is more oriented towards having support tools to solve very specific needs of information or specialized knowledge that must be incorporated into larger and more complex knowledge graphs.
The examples of application of the method allow one to observe and analyze the specific decision making that must be performed during the design of the knowledge modules and the selection of the necessary attributes to meet the requirements, and that once the design is complete, a mechanism for automatic population is programmed.
It should be noted that the reuse of knowledge graphs entails many challenges, one of the most complicated being that the structure of the original graph may not be the most appropriate for the destination graph. Therefore, in the examples shown in this article, we have made the decision to make changes in the way of representing the classes and class hierarchies. We have chosen to handle many concepts or classes as individuals in the destination graph.