1. Introduction
The research objective is to construct of a disease-symptom knowledge graph (DSKG) automatically from determined disease-symptom relations on documents downloaded from two medical web-board resources, a Thai-hospital web-board resource (e.g.,
http://www.si.mahidol.ac.th/sidoctor/e-pl/ (accessed on 15 February 2021);
https://www.bumrungrad.com (accessed on 15 February 2021); and etc.) and a Thai-Healthcare-Knowledge web-board resource (e.g.,
http://haamor.com/ (accessed on 15 February 2021); and
https://www.doctor.or.th/doctorme/general (accessed on 15 February 2021)). The DSKG is formed as a cause-effect knowledge graph that represents disease-symptom relations which are a cause-effect relation type between causative concept nodes and effect concept nodes, where each causative concept node is represented by a certain disease-name concept, and each effect concept node is represented by a correlated symptom-concept group. With regard to [
1], most of the patients with a certain disease have multiple symptoms rather than a single symptom, and some of these multiple symptoms are also either correlated or represent co-occurrences as common symptoms among some diseases; e.g., stuffy, runny nose, and cough symptoms are common symptoms among Cold, Flu, Airborne Allergy, and COVID-19 diseases (
https://newsinhealth.nih.gov/2022/01/it-flu-covid-19-allergies-or-cold (accessed on 15 February 2021). The DSKG presents several disease-symptom relations determined or extracted from the downloaded documents on which each disease-symptom relation is a link connecting a disease-name concept node to a node containing a symptom-concept group as an occurrence of multiple symptoms. Furthermore, this disease-symptom relation is a semantic relation, particularly the cause-effect relation type (called CErel), which links each disease-name concept (
di;
i = 1, 2, …,
numofDiseaseElements) as the causative concept to the corresponding symptom-concept group (SymGroup
di) as the effect-concept group resulted by
di from the
di document (which is the downloaded disease document having
di as the document topic name), where:
di ∈ DS is a disease-name concept set; SymGroup
di ⊂ SG is a set of symptom-concept groups resulted by the corresponding DS elements. Moreover, the symptom concepts of our research include sign concepts and also are the basic symptoms of each
di. In addition to CErel, the semantic relation is a directional link between two or more concepts, entities, or sets of entities that participate in the relation [
2] as follows:
where “<…>” and “(…)” symbols represent a concept and a relation type, respectively.
Thus, CErel is the relation type which links Concept1, e.g.,
di, to Concept2, e.g., SymGroup
di, as follows:
where SymGroup
di = {
Symi1,
Symi2, …,
Symi last_i}
of_di; SYM is the universal symptom-concept set, and then
Symij ∈ SYM;
i = 1, 2, …,
numofDiseaseElements;
j = 1, 2, …,
last_i; SymGroup
d1 ∪ SymGroup
d2… ∪ SymGroup
d numofDiseaseElements ⊂ SYM; and the result of SymGroup
di ∩ SymGroup
dl is either not null or null on which
i<>
l and 1 ≤
l ≤
numofDisease Elements. Moreover, the concepts and the relations are the foundation of knowledge and thought [
2] where the concepts are the building blocks of knowledge, and the relations are the cement linking up the concepts into the knowledge structures. According to the knowledge structure, the DSKG is formed by the CErel connections which connect several different <
di> nodes to a node containing several correlated
sc features (where
sc is a symptom-concept expressed on the documents;
sc ∈ S is obtained by the union of all subsets of SG or all SymGroup
di from all
di documents;
i = 1, 2, …,
num is
numofDiseaseElements; S ⊂ SYM; S = {
sc}; and
c is an index,
c = 1, 2, ...,
m which is the number of symptom-concept features; see
Figure 1).
With regards to
Figure 1, the DSKG also presents a <
s1,
s8, …, s
β> node as the common symptom-concept features among
di nodes.
In addition, identifying the symptoms in terms of the symptom concepts is vital towards diagnostics of diseases in the medical field. Approximately 70–90% of the diagnostic information is comprised of a patient’s history and physical examinations that involve costly physical tests [
1]. Although about one third of the identifiable common symptom concepts do not provide a conclusive disease-based explanation [
1], the DSKG of our research can be used as an additional healthcare procedure for preliminary diagnosis of some potential diseases during the diagnostic processes which potentially reduces physical examination costs. Moreover, the DSKG will help healthcare practitioners to avoid tunnel visioning and maintain awareness on the presence of multiple symptom concepts instead. The DSKG can also be expanded beyond its use by healthcare practitioners to non-professionals in preliminary diagnosis of the possible diseases from some actual symptom occurrences as a web application system containing a healthcare recommender application using the DSKG on their mobile phones or computers.
Thus, the research focuses on constructing the DSKG from the determined cause-effect pairs, i.e., the
di-SymGroup
di pairs, having CErel from the downloaded disease documents on the Thai medical web-board resources. The
symij element expressed on the
di document is mostly based on an event expression on an Elementary Discourse Unit (EDU which is defined as a simple sentence or a clause by [
3]). In addition, the event expression is explained by a verb with the event semantic [
4] on the EDU’s verb phrase, where each EDU expression is based on a general linguistic expression, e.g., a general Thai linguistic expression (see
Figure 2), after stemming words and completing stop-word elimination.
In
Figure 2, a concept of each element (called “an element concept”) in the Verb
weak, Verb
strong, Adv, Adj, and Noun sets is based on the medical-symptom-expression list of Wikipedia (
https://en.wikipedia.org/wiki/List_of_medical_symptoms (accessed on 10 January 2022) and MeSH (https//:
www.ncbi.nlm.nih.gov/mesh (accessed on 15 February 2021), after translating from English to Thai by the Lexitron Dictionary (
https://dict.longdo.com (accessed on 15 February 2021) followed by the Thai to English translation by the Lexitron Dictionary and WordNet [
5] (https//:word-net.princeton.edu/obtain (accessed on 15 February 2021). An example of a downloaded disease document is shown in
Figure 3 which contains seven different symptom-concept expressions based on verb phrases of EDU2-EDU7 and EDU11.
There are several techniques in the literature [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15] applied for determining the cause-effect/disease-symptom relation from the unstructured data, e.g., texts, without constructing the cause-effect/disease-symptom graph or network except [
7,
12,
13,
14,
15] (see
Section 2) where each cause-effect/disease-symptom relation in the cause-effect/disease-symptom graph or network of [
7,
12,
13,
14,
15] is based on a causative-concept feature, e.g., a disease-name concept feature, connecting to one effect-concept feature, e.g., a symptom-concept feature. In contrast, unlike the aforementioned literature, our DSKG is constructed by several CErel connections where each CErel connection is the link between one causative-concept feature, e.g., a
di feature, and a group of correlated effect-concept features (e.g., a group of the
sn1, s
n2, …, and
snη features which are the correlated
sc features as the common symptom-concept features) where
n1,
n2, …, and
nη are the index (
c) values; 0 <
n1 <
n2 < … <
nη;
η is
numberOfCorrelatedSymtomConcepts; and
η ≤
m is the number of symptom-concept features (see
Figure 1).
In addition to [
12,
13,
14,
15], determination of the CErel or the disease-symptom relation from the documents involves a disease-name concept feature set and a symptom-concept feature set where their symptom-concept feature elements are mostly expressed by at least one term/word on NP1 or NP2 of a simple sentence (see
Section 2). Whilst there is another research [
16] working on only a symptom recognition from documents without determining the disease-symptom relation where their symptoms are based on either NP1 or VP. In contrast, the symptom-concept features of our DSKG construction are mostly expressed by at least two terms/words as a composite variable [
17] on an EDU’s verb phrase (VP) including NP1 of the EDU after stemming words and stopping word removal, e.g., “((
คอ/
throat)/NP1 ((
เจ็บ/
hurt)/Verb
strong)/VP” (
a sore throat), (
ศรีษะ/
head)/NP1 ((
มี/
has)/Verb
weak (
อาการ/
sympyom)/noun (
บวม/
swelling)/NP2)/VP (
a swollen head), and “(
คนไข้/
patient)/NP1 ((
ปวด/
pain)/Verb
strong (
ศรีษะ/
head)/NP2)/VP” (
The patient has a headache)). The composite variable is a variable made up of two or more individual variables, called indicators, into a single variable [
17]. Each indicator alone does not provide sufficient information, but altogether they can represent the more complex concept. The composite variable of the symptom-concept feature of our research consists of one or two terms from the EDU’s VP and one term from the EDU’s NP1 for obtaining a symptom/effect concept.
However, the Thai documents have some specific characteristics, such as zero anaphora or implicit noun phrases, without word and sentence delimiters, etc. All of these characteristics are involved in three main problems for constructing the DSKG from the documents: (1) how to determine a symptom concept of an EDU
h occurrence as EDU
h,Symij (which is an EDU
h occurrence with a symptom concept,
Symij, on a
di document,
h = 1, 2, ..,
endDocument_di) among several EDU
h occurrences with non-symptom concepts on the
di document; (2) how to determine CErel between
di and each SymGroup
di element (
Symij of
di) from a downloaded
di document for subsequently determining a
di-SymGroup
di pair with Cerel, where SymGroup
di appears as a symptom-concept EDU vector, i.e., 〈EDU
h1,Symi1, EDU
h2,Symi2, ..., EDU
hrim,Symi last_i〉
of_di, on the
di document (0 <
h1 <
h2 < … <
hrim ≤
endDocument_di), whilst some downloaded
di documents contain
di symptom concepts of the primary disease with/without other symptom concepts resulted by complications as the secondary disease; and (3) how to construct the DSKG based on each <
di> node (which is represented by the
di feature), connecting to the <
sn1, s
n2, …,
snη> node (which is represented by a group of the correlated
sc features) with the problem of the high dimensional feature space of
sc, where
sc ∈ S is obtained by the union of all SymGroup
di from the determined
di-SymGroup
di pairs having CErel. S then has the high dimensional feature space [
18] resulting in a time consuming way to find the correlated
sc features for the DSKG construction.
We then need to develop a framework which combines the statistical techniques, the machine learning techniques, and the linguistic phenomena to learn several EDU expressions for solving the research problems. With regard to
Figure 3, most of the symptom-concept occurrences on the documents are based on Verb or VP (see
Figure 2). Therefore, we apply a
word co-occurrence (called “
wc”) pattern on an EDU occurrence where a
wc pattern consists of three major term sets on an EDU occurrence, a predicate-verb term set, an agent term set, and a patient/information term set (see
Section 3.1). The word co-occurrence [
19,
20] is expressed as a compound term with/without any pattern or restriction depending on each research perspective, whilst the
wc pattern of our research is used for obtaining an EDU’s
wc expression as a composite variable along with determining a
wc concept, particularly a
symptom concept, which is called “
wcSym” of a
wc expression on an EDU occurrence. Thus, the symptom-concept feature,
Symij, occurring on the EDU
h,Symij occurrence of the
di document, is represented by
wcSymij of the EDU
h,Symij’s
wc expression without the concept annotation. With regard to the
di document,
wcSymij and
di are then used as a
wcSymij feature and a
di feature, respectively, for determining a
di-SymGroup
di pair with CErel where the concepts of
wcSymij and
di are based on the Verb
weak, Verb
strong, Adv, Adj, and Noun sets (on
Figure 2) prepared and collected from the medical-symptom-expression list of Wikipedia and MeSH after the English-Thai translations. Likewise, three contributions of this paper proved a statistically based approach involving machine learning. The first one is how to determine a SymGroup
di element,
Symij, based on the
wc pattern used for determining the
wcSymij feature as the composite variable without the concept annotation, whereas the symptom concepts in previous research, e.g., [
13,
15], are based on only NP1/NP2, whilst NP1 is likely an ellipsis (or NP1 has a null value) in our documents. The second one is how to determine
di-
wcSymij pairs having CErel by machine learning techniques with automatic-supervised learning automatically where the positive/negative instances are formed by the Cartesian product of DS × SG [
21] with consideration of the disease type (
t or Type
t); see
Section 3.2. According to the DSKG, each
wcSymt,ij feature is based on the basic symptom of
dt,i where
dt,i is
di in Type
t, and
wcSymt,ij is
wcSymij of
dt,i. Thus, the symptoms of the complications are excluded by the complicationTerm set, {‘
ภาวะแทรกซ้อน/
complcations’, ‘
ไม่รักษา/unTreat’, ...}. And the third one is how to construct the DSKG by clustering or wrapping the determined S elements to become the group of the correlated
sc features as the dimensionality reduction of the feature space of S with minimized information loss. Thus, the constructed DSKG diagram also presents the <
sn1, s
n2, …,
snη> node as the multi-symptom concept node among some <
di> nodes.
Therefore, we apply the
wc pattern to obtain the
wc expression of EDU
h,Symij where the
wc pattern is relied upon a predicate-argument pattern [
22] (see
Section 3.1) The
wcSymij feature is determined from the
wc expression by the elements in the Verb
weak, Verb
strong, Adv, Adj, and Noun sets collected from the medical-symptom-expression list of Wikipedia including MeSH through the English-Thai translation without the concept annotation (see
Section 3.1 and ii in
Section 4.2). Moreover, all symptoms of the complications are excluded if the complications occur right after the complicationTerm set element. We also apply the machine learning techniques: Support Vector Machine (SVM) [
23], Naïve Bayes (NB) [
24], and Linear Logistic Regression (LR) [
25] to determine
dt,i-
wcSymt,ij pairs having CErel by the automatic-supervised learning from the result of the Cartesian product of DS × SG aligned with the disease type (
t or Type
t) on the learning corpus (see
Section 3.2). Each
dt,i-SymGroup
dt,i pair with CErel of Type
t is subsequently solved by grouping the determined
dt,i-
wcSymt,ij pairs having CErel with the same
dt,i from the test corpus. We then propose using the principal component analysis (PCA) [
26] to solve the high dimensional feature space of S by wrapping the S elements to become the group of the correlated
sc features as the common features for constructing the DSKG (see
Section 3.3).
Our research is organized into six sections. In
Section 2, related works are summarized. Research problems in constructing the DSKG from the documents are described in
Section 3, and
Section 4 shows our framework for constructing the DSKG through CErel determination from the documents. In
Section 5, we evaluate and discuss our proposed methodology and then present the conclusion in
Section 6.
2. Related Works
Several strategies [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15] have been proposed to determine the cause-effect/disease-symptom relation from the documents as the unstructured data without concerning the cause-effect/disease-symptom knowledge graph construction except [
7,
12,
13,
14,
15]. Girju [
6] determined a causal relation based on a lexico syntactic pattern (NP1 causal-verb NP2) by decision tree learning. Therefore, the cause/disease and effect/symptom occurrences are based on noun phrases as NP1 and NP2, respectively, with in one sentence. In contrast, the causal relation based on one complex sentence [
7] was determined/extracted by using a cue-phrase set (which was a word, a phrase, or a word pattern) for connecting two NP-pair as a cause and an effect including probabilities. The extracted causal relations [
7] were used for constructing the causal network as the knowledge graph for the term protein with the two relations of the causal relation and the hypernym relation without concerning the high dimensional effect feature sets. Moreover, Riaz and Girju [
8] used a set of linguistic features and Integer Linear Programming to learn a causal relation within one sentence from annotated verb
effect-noun
cause pairs on verb phrases as the causal relation based on the expert annotations and FrameNet including WordNet to generate a training corpus. For example: “A campaign has started to try to cut the rising number of children
dying [
cause from solvent abuse].” where the pair “
dying_solvent abuse” encodes causality by the annotation. Reference [
9] proposed the Restricted Hidden Naïve Bayes model to the lexico syntactic pattern (NP1 ConnectiveVerb NP2 where NP1 and NP2 are noun phrases as a cause and an effect, respectively, or vice-versa) of each sentence to learn the classes, annotated by experts on 26 feature templates categorized into four feature groups, the contextual, syntactic, positional, and connective features. They extracted/determined the causality with the 0.873 precision and the 0.841 recall from the English documents. Reference [
10] also extracted/determined the causal relation within one sentence (where causes and effects were based on noun expressions) by using the linguistic rules based along with Bayesian inference to reduce the number of pairs produced by ambiguous patterns, whilst [
11] used syntactic patterns by manual annotation with one sentence or between two sentences having a cause-effect link/relation. The cause-effect links were extracted or determined by a syntactic pattern-based algorithm from scientific papers with 47% and 70% on average precision and recall, respectively. The determined cause-effect links were applied to represent the core of scientific papers as a summarization. Reference [
12] extracted disease-symptom relationships from texts by using the syntactic-patterns based on the quality- and specificity-based selection from several determined syntactic-patterns (where each syntactic-pattern was determined on each dependency graph of a sentence containing both a disease entity and a symptom entity based on a noun term or a noun phrase). Reference [
13] automatically constructed the health knowledge graphs of a disease-symptom connection by using logistic regression, naive Bayes, and a Bayesian network using noisy OR gates to learn and determine the connection or relation between the disease codes and the symptom concepts from textual data of patient medical records with a 0.85 precision, a 0.6 recall, a 0.704 F1-score with the positive/negative classes based on the actual symptom occurrence on the textual data. However, the symptom expressions on the textual data were based on noun phrases, e.g., “(
The patient)/NP1 (
complains of (
a worsening cough)/NP2)/VP” and “(
He)/NP1 (
also has (
a dry cough)/NP2)/VP.”. Reference [
14] determined each disease-symptom relation and also symptom names within one sentence by learning of a multi-column convolutional neural network (MCNN) based on a human annotator from 50,000 random sentences of the Japanese web texts. MCNN had an input sentence which was divided into five consecutive word sequences: a symptom candidate name (SYMname), a given disease name (DISname), the word sequence before the SYMname, the word sequence between the SYMname and the DISname, and the word sequence after the DISname. Their proposed symptom name extraction method achieved a 93.8% F1-score, and the disease–symptom relationship extraction method achieved an 88.3% F1-score where the extracted symptom names were based on noun or noun-phrase expressions. Reference [
15] used the PubMed bibliographic literature database and the association between symptoms and diseases in the MeSH metadata fields of PubMed to determine the disease-symptom relationships where symptoms and diseases were based on noun phrases. They applied the term frequency-inverse document frequency to measure the strength of an association between
symptomi and
diseasej for constructing the disease-symptom network without concerning the high dimensional symptom features. Moreover, Ref. [
16] recognized only the medical symptom expressions on patient texts without determining the disease-symptom relation. They applied the sentence/phrase templates based on either a noun phrase or a verb phrase including the symptom concepts labeled by the experts to capture the surface of symptom expressions from the patient text. The machine learning techniques were applied for the multi-label classification of symptoms including the long tail symptoms from the surface of the symptom expressions. The [
16] model achieved a 76% F1-score.
However, the causative-concept and effect-concept features of the previous works [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15] are based on at least one word of either the verb term expression or the noun-term/noun-phrase expression without the composite variable consideration. The cause-effect relation/association determinations of [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15] except [
13] are based on machine learning by the expert or human annotation, whereas CErel of our research was based on the automatic supervised learning. According to a few research works [
7,
12,
13,
14,
15] on the cause-effect graph/network construction from the unstructured data, i.e., texts, their cause-effect graphs/networks are based on each CErel connection between the node of the causative-concept feature and the node of the effect-concept feature without concerning the high dimensional data on the feature set, particularly the effect-concept feature set, even if their corpora are large sizes. However, our DSKG construction (which is based on the CErel connection between the <
di> node and the <
sn1, s
n2, …,
snη> node containing the group of the correlated
sc features as an occurrence of multiple symptoms) involves the high dimensional feature problem.
3. Problems of DSKG Construction
There are three main problems that must be solved: how to determine the wcSymij features on the EDUh,Symij occurrences without concept annotation on the di documents, how to determine the di-wcSymij pairs having CErel with automatic supervised learning, and how to construct the DSKG based on the correlated sc features with the high dimensional feature problem.
3.1. How to Determine wcSymij Features on EDUh,Symij Occurrences without Concept Annotation on di Documents
According to the corpus behavior study of the health-care domain, most of the symptom-concepts on the EDU
h,Symij occurrences are the event or state expressions by the verb phrases where each verb phrase contains a predicate verb/a predicate-verb term (
va; a = 1, 2, ...,
numberOfpredicate) which is used to identify the EDU
h,Symij expressions among EDU
h expressions as shown in the following (a)–(d) examples (other than
Figure 3) with another problem of having the same concept with different verb phrase expressions resulting in the different
wc expressions. These examples also include the phonetic expression by
http://translate.google.com/ (accessed on 15 February 2021).
Example 1:
- (a)
EDU: “[คนไข้] ปวดกล้ามเนื้อ” ([A patient] has a muscel pain.)
“([(คนไข้/Khnk̄hị̂)/patient])/NP1
(((ปวด/pwd)/pain)/Verbstrong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun)/VP”
- (b)
EDU: “ผู้ป่วยมีอาการปวดกล้ามเนื้อเล็กน้อย” (A patient has a symptom of mild muscle pain.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((มี/mī)/has)/Verbweak ((อาการ/xākār)/symptom)/Noun ((ปวด/pwd)/pain)/Verbstrong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun ((เล็กน้อย/lĕkn̂xy)/mild)/Adj)/VP”
- (c)
EDU: “ผู้สูงอายุมีอาการเหนื่อย” (An elder has a tired symptom.)
“((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/NP1 (((มี/mī)/has)/Verbweak ((อาการ/xākār)/symptom)/Noun ((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verbstrong)/VP”
- (d)
EDU: “[ผู้ป่วย] เหนื่อยมาก” ([A patient] is very tired.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verbstrong (มาก/māk)/very)/Adv)/VP”
where: (a) and (b) examples and (c) and (d) examples have different verb phrase expressions with the same major symptom concepts of “(ปวดกล้ามเนื้อ/pwd kl̂ām neụ̄̂x)/pains in muscle” and “(เหนื่อย/h̄enụ̄̀xy)/be tired”, respectively; and the [..] symbol means ellipsis.
With regard to [
22], the predicate-argument pattern is shown in the following,
verb(
agent_argument,
patient/
information_argument) where
verb is an element of a predicate-verb term set;
agent_argement is an element of an agent term set; and
patient/information_argument is an element of a patient/information term set. According to the predicate-argument pattern, we then apply the following
wc pattern on each EDU occurrence based on
Figure 2 to obtain the
wc expressions for automatically determining the
wcSymij features after the EDU
h,Symij identification from the d
i documents by the predicate-verb term set.
where:
V is a predicate-verb term set; V = Verbstrong ∪ Vinf; va ∈ V. Since vweak,b has a weak symptom concept, winf,c as an information word is added to vweak,b to become a strong symptom concept which is an element of Vinf or (vweak,b + winf-c) ∈ Vinf where vweak,b ∈ Verbweak; winf,c ∈ (Noun ∪ Adj ∪ Verbstrong); winf,c is a word right after vweak,b; and a, b, c are an integer or index.
W1 is an agent term set; w1,g ∈ W1; w1,d is a head noun or a Noun element of NP1; and g is an integer or index.
W2 is a linguistic patient/information term set; w2,e ∈ W2; W2 = Noun ∪ Adj ∪ Adv; w2,e is also a word sequence right after va; w2,e has a null value if w2,e does not exist; and e is an integer or index.
Moreover, the concept elements of Verb
strong, Verb
weak, V
inf, Noun, Adj, and Adv sets of
Figure 2 are firstly prepared and collected from the results of the Thai-word and Thai-EDU segmentations on the translated terms (English to Thai by the Lexitron Dictionary) of the medical-symptom-expression list on the Wikipedia and MeSH web sites. In addition, the segmented Thai words are then translated from Thai to English by the Lexitron Dictionary and WordNet for collecting a concept of each element in the Verb
strong, Verb
weak, Vinf, Noun, Adj, and Adv sets on
Figure 2. If the segmented Thai word has several English word senses, the expert will select the corresponding symptom concept in English (see ii in
Section 4.2).
According to each corpus of our research, the wcSymij feature (which presents in the form of the predicate-argument pattern) is then determined from each term of the wc expression including the Thai-to-English translation by the collected element concepts of the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets. For the approach of the wc expression for the wcSymij feature determination, the w1,g ellipsis has to be solved by the following rule: w1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis. In addition to the problem of the different verb phrase expressions with the same symptom concept, we apply another rule to obtain wc expression with the actual symptom expression: if “vweak,b + winf,c” is “มี/have + อาการ/symptom”, we take the next two words right after “vweak,b + winf,c” to be va and w2,e as the actual symptom expression instead as shown in the following from the previous (b) and (c) in Example 1 including each term translation on the wc expression from Thai to English by using the collected element concepts in the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets.
- (b)
V + W1 + W2 = ((ปวด/pwd)/pain)/Verbstrong +([(ผู้ป่วย/P̄hū̂p̀wy)/patient])/Noun + ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun
- (c)
V + W1 + W2 = ((เหนื่อย/h̄enụ̄̀xy)/beTired)/Verbstrong + ((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/Noun + null
Moreover, there are some general-concept rules applied to the acquired wc expressions for the general wcSymij feature presentation: the w1,g concept is “person” if w1,g is in the Person set, {‘คนไข้,ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, …}. In addition, if the term in w2,e is a concept of an element in the Symptom-Expression-Level set, {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}, w2,e has a null value for a general symptom concept. Thus, the wcsymij features of EDUs’ wc expressions of the (a)–(d) in Example 1 are represented by the predicate-argument pattern as follows: (a) pain([person],muscle), (b) pain(person,muscle), (c) beTired(person), (d) beTired([person]) where (a), (b) examples and (c), (d) examples have different EDUs’ verb-phrase expressions but the same symptom concepts of pain(person,muscle) and beTired(person), respectively.
Therefore, after each EDUh,Symij occurrence on the di document has been identified by the predicate-verb term (va) followed by the w1,g, and w2,e of the wc pattern, the wcSymij feature is automatically determined by translating all wc expression terms (va, w1,g, w2,e) from Thai to English by the collected concept elements of the Verbstrong, Verbweak, Vinf, Noun, Adj, and Adv sets.
3.2. How to Determine di-wcSymij Pairs Having CErel with Automatic Supervised Learning
We apply SVM, NB, and LR to learn the
di-wcSymij pairs having CErel with the automatic-supervised learning from the learning corpus where the positive and negative instances with the CErel and nonCErel classes, respectively, are assigned by the Cartesian product of DS × SG aligned with the disease types. Thus, the downloaded disease documents are separated into two groups according to two disease types, an infectious disease type (Type
t = Type
1) and a non-infectious disease type (Type
t = Type
2). Each disease-type contains the
dt,i-SymGroup
dt,i pairs having the CErel connections which link the
dt,i features to the corresponding SymGroup
dt,i features determined from the
dt,i documents (where
dt,i is
di in Type
t; SymGroup
dt,i is SymGroup
di resulted by
dt,i from the
dt,i document; and
wcSymt,ij is
wcSymij of
dt,i). Thus,
Symt,ij (which is a
Symij element in Type
t) is a SymGroup
dt,i element represented by
wcSymt,ij as follows:
where
I = 1, 2, ...,
numofDseaseElementst;
j = 1, 2, ...,
last_i.
Moreover, there are some downloaded dt,i documents containing both the dt,i symptom expressions and the symptom expressions of the dt,i complications. Therefore, if the element of the complicationTerm set is identified on the dt,i document, then all symptoms, wcSymt,ij, right after the occurrence of the complicationTerm set element, are excluded.
The positive/negative instances of the
dt,i wcSymt,ij pairs from each
dt,i-SymGroup
dt,i pair are formed by the result of the Cartesian product of DS × SG as follows: each
dt,i wcSymt,ij pair on the certain
dt,i-SymGroup
dt,i pair is the positive instance having the CErel or the positive class if
dt,i and SymGroup
dt,i have the same Type
t. Meanwhile, each
dt,i wcSymt,i,j pair on the certain
dt,i-SymGroup
dt,i pair is the negative instance having nonCErel or the negative class based on Type
t of the
dt,i document if
dt,i and SymGroup
dt,I have the different Type
t (see
Figure 4).
where D
t is a disease name set in Type
t;
t = 1, 2;
where
i = 1, 2, ...,
numofDiseaseElementst;
where SymGroup
dt,i = {
wcSymt,i1,
wcSymt,i2,
..,
wcSymt,ilast_i}
of_ dt,i.
According to the automatic-supervised learning by NB, SVM, and LR to each disease type from the learning corpus, we then determine the dt,i wcSymt,ij pairs having CErel on the dt,i documents of each disease type from the test corpus. Thus, the dt,i-SymGroupdt,i pair with CErel is determined by grouping all determined dt,i wcSymt,ij pairs having CErel by the same dt,i from the dt,i document. All dt,i-SymGroupdt,i pairs having CErel are then used for constructing the DSKG.
3.3. How to Construct DSKG with the Problem of High Dimensionality of Symptom-Concept Features
After determining the
dt,i-SymGroup
dt,i pairs having CErel from all downloaded
dt,i documents, there are many different
sc features from the union of all SymGroup
dt,i as a high dimensional feature space of S (or the high dimensional symptom concepts) to form the DSKG from several different disease-name concepts (
dt,i). The high dimensional symptom concepts result in difficulty in finding the relevant disease symptom concepts for constructing the concise DSKG. Therefore, we propose using PCA to solve the high dimensionality problems. According to PCA by [
27], a variance and a covariance are defined as follows:
The variance of a random variable is defined as
where (6) shows how variance measures the average deviation from the mean value. When we have more than one random variable, it is useful to analyze the covariance:
If the covariance is zero, which is equivalent to saying that the correlation coefficient is zero, the variables are said to be uncorrelated. The variances and covariances of the elements of a random vector
x are often connected to a covariance matrix (C(
x)) whose
a,
b-th element is simply the covariance of X
a and X
b:
The diagonal of the covariance matrix gives the variances. The covariance matrix is basically a generalization of variance to random vectors. In addition to our research, the covariance of X
a and X
b features are
sa and
sb features within the symptom-concept feature matrix of
mx
m (where
m is the number of different symptom-concept features;
m > 100). Therefore, 〈
s1,
s2, …,
sm〉 is a symptom-concept feature vector on which
sc ∈ S, and
c = 1, 2, ..,
m. The symptom-concept feature vector of the symptom-concept feature matrix is then rotated for grouping the symptom-concept features of the vector into separated feature groups with the minimum number of separated feature groups where each separated feature group is called “Fgroup
z”;
z = 1, 2, ..,
numofFeatureGroups and
numofFeatureGroups is less than
m. After the symptom-concept feature vector rotation, a feature loading weight from an eigenvector for the
sc feature is determined according to Fgroup
z. The high feature loading weight of
sc to Fgroup
z infers that the correlation between
sc and Fgroup
z is high. The different
sc feature elements with the high feature loading weights in a certain Fgroup
z are wrapped to become a factor (called “Factor
z”) including its factor score (called “FactorScore
z”) determined by Equation (9) from the feature loading weights of the wrapped
sc feature elements.
where
is a feature loading weight of from an eigenvector in ;
is a symptom-concept () feature element within ;
= 1, 2, …, ;
is the number of different symptom-concept features in
is an original value of the number of each with its mean, ,
and standard deviation .
Therefore, the symptom-concept feature space is reduced from
m to
numofFeatureGroups, which results in a less time consuming way to find the relevant
sc feature elements as the common symptom concepts among the certain
di occurrences to construct the DSKG (see
Section 4.5).
4. System Overview
There are five steps in our framework; Corpus Preparation, Determination of
dt,i and
wcSymt,ij Features, Automatic Supervised Learning
dt,i wcSymt,ij Pairs Having CErel, Determination of
dt,i wcSymt,ij Pairs Having CErel for Collecting
dt,i-SymGroup
dt,iPairWithCErel, and DSKG Construction as shown in
Figure 5.
4.1. Corpus Preparation
This step is the preparation of two disease-symptom corpora from the downloaded disease documents having the disease names as the document topic names on two medical web-board resources; one disease-symptom corpus downloaded from the Thai-Healthcare-Knowledge web-board resource is used as the learning corpus, and the other one downloaded from the Thai-hospital web-board resource is used as the test corpus. Each disease-symptom corpus consists of the same disease name concepts from 70 different disease-name documents (or the 70
di documents) on each medical web-board resource where the documents associated with various infections, strokes, kidneys, diabetes, cardio- and vascular diseases are randomly selected from about 700 and 400 different disease-name documents of the Thai-Healthcare-Knowledge web-board resource and the Thai-hospital web-board resource, respectively. The selection of these diseases is motivated by the rapidly increasing number of patient cases in Thailand, e.g., diabetes [
28]. This step involves using Thai word segmentation tools [
29] including named entity recognition [
30,
31] on each disease-symptom corpus. After the word segmentation has been achieved on the corpora, EDU segmentation is then dealt with [
32,
33]. The learning corpus and the test corpus then have 12,000 EDUs and 10,000 EDUs, respectively. With regard to each medical web-board resource, each disease-symptom corpus consists of 33 different disease names in Type
1 (or 33
d1,i documents of the infectious disease type) and 37 different disease names in Type
2 (or 37
d2,i documents of the non-infectious disease type). Thus, a sample size by random sampling for evaluating the symptom concept determination on the test corpus is about 35 different
di documents (which consisted of both disease types) by Equation (10) [
34].
where
.
Moreover, all instances from the results of the Cartesian product of DS × SG on the correct symptom-concept determination and the correct disease-name recognition [
30,
31] from the learning corpus are used for an automatic-supervised learning of the
dt,i wcSymt,ij pairs having CErel within each disease type, Type
t: Type
1 and Type
2, based on tenfold cross validation (see
Section 4.3). The learning results are used to determine the
dt,i wcSymt,ij pairs having CErel from the test corpus, whilst the correct
dt,i wcSymt,ij pairs having CErel are collected into the
dt,i-SymGroup
dt,i pair with CErel having the same
dt,i for the DSKG construction.
4.2. Determination of dt,i and wcSymt,ij Features
The objective of this step is to determine the
dt,i and
wcSymt,ij features from the learning corpus and the test corpus of each disease type (Type
1, Type
2) for Automatic-Supervised-Learning of
dt,i wcSymt,ij Pairs Having CErel in
Section 4.3 and Determination of
dt,i wcSymt,ij Pairs having CErel for Collecting
dt,i-SymGroup
dt,iPairWithCErel in
Section 4.4, respectively.
- i.
Determine dt,i Features
The disease name feature,
dt,i, from the
dt,i document having
di as the topic name in Type
t is determined by the named-entity recognition [
30,
31] in each disease type of both the learning corpus and the test corpus from the previous step of “Corpus Preparation” (
Section 4.1) and then the named-entity translation from Thai to English by using the Lexitron dictionary. The determined
dt,i features from the learning corpus are then collected into the D
t set for Automatic-Supervised-Learning in
Section 4.3.
- ii.
Determination of wcsymt,ij Features
With respect to the
wc pattern, it is necessary to prepare and collect the Verb
strong, Verb
weak, Noun, Adj, and Adv sets of
Figure 2 before determining the
wcSymt,ij features as follows:
- ●
Preparation and collection of the Verb
strong, Verb
weak, Noun, Adj, and Adv sets: Each element concept in the Verb
strong, Verb
weak, Noun, Adj, and Adv sets on
Figure 2 is prepared and collected from several terms on the medical-symptom-expression list of the Wikipedia and MeSH web sites after translating these terms from English to Thai by the Lexitron dictionary as shown in
Table 1.
From
Table 1, some translated terms of the noun expressions on the medical-symptom-expression list (from English to Thai by the Lexitron Dictionary) are presented as EDU or noun-phrase expressions in Thai, for example:
Example 2. “
arrhythmia” = “
หัวใจเต้นผิดจังหวะ/H̄ạwcı tên p̄hid cạngh̄wa” as a Thai sentence/EDU expression which is presented in the following with the part of speech after using word and EDU segmentation tools [
29,
32,
33]:
EDU: ((หัวใจ/H̄ạwci)/Noun)NP1 ((เต้น/tên)/Verbstrong (ผิดจังหวะ/p̄hid cạngh̄wa)/Adv)/VP
These segmented words of the EDU segmentation are then translated from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Verb
strong, Noun, and Adv sets (on
Figure 2) as shown in the following:
((หัวใจ/H̄ạwci)/heart)/Noun, ((เต้น/tên)/pulse)/Verbstrong, and ((ผิดจังหวะ/p̄hid cạngh̄wa)/irregularly)/Adv,
Example 3. “
palpitation” = “
อาการใจสั่น/Xākār Cı s̄ạ̀n)” as a Thai noun phrase expression which is presented in the following with the part of speech after using the word segmentation tool [
29]:
((อาการ/Xākār)/Noun (ใจ/Cı)/Noun (สั่น/s̄ạ̀n)/Verbstrong)/NP
The results of this NP’s term translation from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Noun and Verb
strong sets (on
Figure 2) are ((
อาการ/Xākār)/
symptom)/Noun, ((
ใจ/Cı)/
heart)/Noun, and ((
สั่น/s̄ạ̀n)/
shake)/Verb
strong.
According to the Thai to English translation of Example 2 and Example 3, if the segmented Thai word has several English concepts, the expert will select the corresponding English concept for collecting the element concepts.
- ●
Determination of
wcSymt,ij features from each
dt,i document: After stemming words and eliminating stop words of either the learning corpus or the test corpus on each disease type, EDU
h,Symt,ij of the
dt,i document is identified by the predicate-verb term (
va ∈ Verb
strong ∪ V
inf). The
wc expression of EDU
h,Symt,ij is then obtained by V, W1, and W2 of the
wc pattern (see
Section 3.1) including the following general rules (R1, R2):
R1: w1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis (where w1,g ∈ W1).
R2: if (((vweak,b + winf,c) ∨ (vstrong,f + w2,e)) ∈ Symptom-Cue) ∧ (Symptom-Cue={มี/have+อาการ/symptom’, ‘เป็น/be + อาการ/symptom’, ‘เกิด/occur + อาการ/symptom’}, then we take the next two words right after “vweak,b + winf,c” or”‘vstrong, f + w2,e” to be the new va + w2,e expression as the symptom expression instead where vweak,b ∈ Verbweak; winf,c ∈ Noun ∪ Adj ∪ Verbstrong; vstrong,f ∈ Verbstrong; w2,e ∈ W2; va ∈ Verbstrong ∪ Vinf; (vweak,b + winf,c) ∈ Vinf ; and a, b, c, e, and f are an integer as the element index.
The wcsymt,ij features of the wc expressions are automatically determined by R3, R4, and R5 as the concept rules and are represented by the predicate-argument pattern after the wc expressions of the disease-symptom documents are translated from Thai to English by the collected element concepts in the Verbstrong, Verbweak, Noun, Adj, and Adv sets.
R3: if (w1,g ∈ Person) ∧ (Person = {‘คนไข้, ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, ……}), then the w1,g concept is “person”.
R4: if w2,e ∈ Symptom-Expression-Level, then w2,e has a null value for a general concept (where Symptom-Expression-Level = {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}).
R5: if (vweak,b = ‘รู้สึก/feel’) ∧ (winf,c= vstrong, f) ∧ (vstrong, f ∈ Verbstrong), then (vweak,b + winf,c)= vstrong, f.
For example, the concept of (รู้สึก/feel)/Verbweak + (ปวด/pain)/Verbstrong is equivalent to “pain” as shown in the following EDU with the wcsymt,ij feature as pain(person, stomach).
EDU: “คนไข้รู้สึกปวดกระเพาะอาหาร/The patient feels pain in the stomach”.
(คนไข้/Khnk̄hị̂)/patient)/NP1 ((รู้สึก/rū̂s̄ụk)/feel)/Verbweak (ปวด/pwd)/pain)/Verbstrong (กระเพาะอาหาร/krapheāa xāh̄ār)/stomach)/Noun)/VP.
In addition to the complications, if the element of the complicationTerm set is identified, then all wcSymt,ij features right after the occurrence of the complicationTerm set element are excluded.
Therefore, all
wcSymt,ij features of the
wc expressions from each
dt,i document of the learning corpus are determined and grouped into the corresponding symptom-concept group of the
dt,i feature (as SymGroup
dt,i) for the automatic supervised-learning step of
dt,i wcSymt,ij Pairs having CErel in
Section 4.3.
4.3. Automatic-Supervised-Learning of dt,i wcSymt,ij Pairs Having CErel
Each
dt,i feature and the SymGroup
dt,i elements as the
wcSymt,ij features determined from the previous step are used for this step of the automatic supervised learning, where Type
t (
t = 1) is the infectious disease type, and Type
t (
t = 2) is the non-infectious disease type. The Type
t’s learning corpus, containing several instances of
dt,i wcSymt,ij pairs resulted from the Cartesian product of DS × SG, is used for learning
dt,i wcSymt,ij pairs having CErel by NB, SVM, and LR on each Type
t. The positive instance (CErel class) of each
dt,i wcSymt,ij pair is formed if Type
t of D
t and Type
t of SymGrp
t are the same; otherwise, the negative instance (nonCErel class) is formed according to Type
t of D
t (see
Section 3.2).
After the Cartesian product of DS × SG results on the learning corpus with the 70 di documents, the positive and the negative instances of each Typet from the Cartesian product results are then selected by the random sampling of the positive instances and the negative instances with an approximately equal number of positive and negative instances covering all 70 di features. The sizes of the Type1’s learning sample and the Type2′s learning sample are 1878 instances and 2125 instances, respectively, containing both the positive and negative instances.
NB learning [
24]: the feature sets, D
t and SymGrp
t, exist in the
dt,i wcSymt,ij pairs of the positive/negative instances with the CE-rel/nonCE-rel class, respectively, formed by the automatic supervised learning on each disease-type learning sample. The learning results of this step by using Weka [
35] are the feature probabilities of
dt,i and
wcSymt,ij in Type
t where each
wcSymt,ij feature is represented by the symptom-concept code (see
Table 2).
SVM learning [
23] with the linear kernel: The linear function, f(
x), of the input
x = (x
1…x
n) assigned to the positive class if
f(
x) ≥ 0, and otherwise to the negative class if
f(x) < 0, can be written as follows:
where
With regard to each disease-type learning sample, the SVM learning determines wk and b for dt,i and wcSymt,ij features (xk) in each disease-symptom pair (dt,i wcSymt,ij) with either the positive class (CE-rel) or the negative class (nonCE-rel) formed by the automatic supervised learning on each disease-type learning sample.
LR [
25]: The logistic regression model of the research is based on the linear logistic regression with binary vector data. Usually, the input data with any value would be used to establish which attributions are influential in predicting the given outcome with values between 0 and 1, and hence can be interpreted as a probability. The logistic function can be written as:
F(x) is interpreted as the probability of the given outcome to be predicted where x1 and x2 are attribute variables, and 0, 1, and 2 are the model estimators which play the role of momentum for each attribute. The LR learning is to determine 0, 1, and 2 for dt,i and wcSymt,ij as x1 and x2 features, respectively, in each disease-symptom pair (dt,i, wcSymt,ij) with either the positive/CErel class or the negative/nonCErel class formed by the automatic supervised learning on each disease-type learning sample.
4.4. Determination of dt,i wcSymt,ij Pairs Having CErel for Collecting dt,i-SymGroupdt,I Pair with CErel
There are three steps in the CErel determination from the test corpus consisting of 70 di documents: dt,i and wcSymt,ij Determination, Determination of dt,i wcSymt,ij Pairs having CErel, and Collection of dt,i-SymGroupdt,i Pairs having CErel
- i.
dt,i and wcsymt,ij Determination
The disease name concept,
dt,i, is determined from the
dt,i document having
di as the topic name in Type
t by the named-entity recognition [
30,
31] followed by the named-entity translation from Thai to English by the Lexitron dictionary in each disease type of the test corpus. After stemming words and eliminating stop words of the test corpus on each disease type, the EDU
h,Symt,ij occurrence is identified by the predicate-verb term set (V; V = Verb
strong ∪ V
inf) to each EDU
h occurrence on the
dt,i document. According to R1–R5, each
wcsymt,ij feature of the
wc expression on the identified EDU
h,Symt,ij occurrence is then determined by translating all terms in the predicate-argument pattern from Thai to English by the collected element concepts in the Verb
strong, Verb
weak, Noun, Adj, and Adv sets. In addition to the complications, if the element of the complicationTerm set is identified, then all
wcSymt,ij features right after the occurrence of the complicationTerm set element are excluded.
- ii.
Determination of dt,i wcsymt,ij Pairs having CErel
The objective of this step is to determine the dt,i wcsymt,ij pairs having the CErel class by NB, SVM, and LR of each disease type on the test corpus.
NB: The cause-effect relation between the
dt,i features and
wcSymt,ij features through test corpus of each disease type is solved by Equation (13) along with the probabilities of
dt,i and
wcSymt,ij on
Table 2.
where
DSympPairClass is a relation class between a disease-name concept and a symptom concept on a disease-symptom pair;
on which is a disease name set in ;
;
;
is a wc concept, particularly a symptom concept of a expression on the document ;
.
If DSymPairClass of Equation (13) is the CE-rel/CErel class, the
dt,i wcsymt,ij pair is collected into DSP
t (which is a list of disease-symptom pairs having CErel in Type
t) as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
Algorithm 1 DeterminationOfDiseaseSymptomPairsHavingCErel Algorithm. |
Assume that each EDU is represented by (NP VP) including stemming words and stop word removal; Ldti is a list of EDUs on dti; dt,i is a disease name of Typet(t={1,2}) i=1,2,.., numofDiseaseElementst; DSPt is alistofdisease-symptom pairs with CErel in Typet; DNamet is a disease name set in Typet; DETERMINATION_OF_DISEASE_SYMPTOM⌝_PAIRS_HAVING⌝_CEREL |
1 | Dname1←∅; DName2←∅; |
2 | ArrayList<string>[] DSP = new ArrayList[2]; String[][] d = new String[2][40]; /*DSPt (t={1,2})contains two ArrayList data structures for Type1 and Type2 |
3 | Set<String> complicationTermSet = new HashSet<String>(); /* complicationTermSet is a set of complication terms |
4 | complicationTermSet.add (“ภาวะแทรกซ้อน/complcations”); |
5 | complicationTermSet.add (“ไม่รักษา/unTreat”);
/* add more elements into complicationTermSet |
6 | Count = ComplicationTermSet.size() /* determine size which is the number of complicationTermSet elements. |
7 | String[] CTS = ComplicationTermSet.toArray (new String[ComplicationTermSet.size()]); /* Convert ComplicationTermSet as a set structure to an Array structure. |
8 | For (t= 1 to 2 ; t++ ) |
9 | {1 If t=1 then n=33 else n=37; |
10 |
For (i= 1 to n ; i++ ) |
11 |
{2 dti = getDiseaseNameConceptFromDocumentTopicName /*get a disease name |
12 |
Dnamet = Dnamet ∪ dti ; h=1; j=0; complication=0 ;
/* Each desease name element is collected into Dnamet |
13 |
while h ≤ length[Ldti] ∧ complication=0 do |
14 |
{3 For (k= 1 to Count; k++) /*check complications. |
15 |
If EDUh.contains(CTSk)then complication=1; |
16 |
vh = EDUh.VP.verb; wrdh = EDUh.VP.word
/* verb is an EDUh’s main verb (a verb of EDUh.VP) ;
/* word is a word right after an EDUh’s main verb of EDUh.VP |
17 |
If (complication = 0)∧(( vh ∈Vstrong)∨( vh + wrdh ∈Vinf))then /* V is the predicate-verb term set; V = Verbstrong∪Vinf . |
18 |
{4 If EDUh.headNounOfNP1 ∈ W1 ∧ EDUh.firstWordOfNP2 ∈ W2 then /* W1 is an agent-term set; W2 is a linguistic-patient term set |
19 |
{5 j++; |
20 |
sym = wcsymt,ijDetermination; /* based on wc Expression of EDUh,Symt,ij by using R1-R5. |
21 |
switch (choice) { |
22 |
Case 1:CErelDetermination(dt,i,sym)byEquation13; break;/*NB |
23 |
Case 2:CErelDetermination(dt,i,sym)byEquation11; break;/*SVM |
24 |
Case 3:CErelDetermination(dt,i,sym)byEquation12, break;/*LR
} |
25 |
If (class= ‘Positive’)∨ (class =’CE-rel’) then |
26 |
DSPt.AddCause⌝EffectPairWithCErel(dt,i+“-“+sym); |
27 |
}5 }4 h++ }3 }2 }1 |
28 | }Return Dnamet, DSPt |
SVM: The cause-effect relation between
dt,i and
wcSymt,ij of the
dt,i wcSymt,ij pair from the test corpus of each disease type is solved by the weight vector from all
dt,i and
wcSymt,ij features. The weight vector and the bias obtained from the SVM learning by using Weka [
35] are used to determine the
dt,i wcSymt,ij pair with CE-rel by Equation (11). If
f(
x) ≥ 0, the
dt,i wcsymt,ij pair with CE-rel/CErel as the positive class occurs, otherwise the nonCE-rel/nonCErel occurs as the negative class. The
dt,i wcSymt,ij pair with the positive class is collected into DSP
t as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
LR: The research applies Equation (12) to determine the DSympPair class which is a relation class, i.e., a CE-rel/nonCE-rel class, between a disease-name concept (dt,i) and a symptom concept (wcSymt,ij) on a dt,i wcSymt,ij pair from the test corpus of each disease type, whilst F(x) is interpreted as the probability of either “positive/CErel” as the CE-rel class or “negative/nonCErel” as the nonCE-rel class by the following rules.
Rule 1 (CE-rel_Class): If (f(x)CE-rel_Class ≥ 0.5, then the dt,i wcSymt,ij pair has CErel between dt,i and wcSymt,ij.
Rule 2 (nonCE-rel_Class): If (
f(
x)
nonCE-rel_Class ≥ 0.5, then the
dt,iwcSymt,i pair has nonCErel between
dt,i and
wcSymt,ij.
According to Equation (14), x1 and x2 are the attribute variable pair of each dt,i wcSymt,ij pair from the test corpus of each disease type where ß0, ß1, and ß2 of dt,i and wcSymt,ij are obtained by the automatic supervised learning by LR on the learning sample of each disease type. The dt,i wcSymt,ij pair with the CE-rel class is collected into DSPt as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).
- iii.
Collection of dt,i-SymGroupdt,i Pairs having CErel
All correct determined
dt,i wcSymt,ij pairs having CErel in DSP
t from the previous step are grouped by the same
dt,i into SymGroup
dt,i resulting in the
dt,i-SymGroup
dt,i pair with CErel as shown in
Figure 6.
Dnamet from the Algorithm 1 results is Dt on Equation (3); Dt = { dt,1, dt,2,….numt} where numt is numofDiseaseElementst in Typet.
According to
Figure 6, all SymGroup
dt,i of the
dt,i-SymGroup
dt,i pairs having CErel are collected into SymGrp
t.
SymGrpt = {SymGroupdt,1, SymGroupdt, 2, .. SymGroupdt,numt};
Dt and SymGrpt are used for the DSKG construction in the next step.
4.5. DSKG Construction
According to D
t and SymGrp
t, the subsets of SymGrp
t form a union into Stype
t by Equation (15). In addition, Stype
t of both disease types form a union into S (which is the union of all SymGroup
dt,i from both disease types) by Equation (16).
where
is a symptom-concept set of (t = 1, 2);
is cardinality of the or set;
is a symptom-concept group resulted by
where
m is the cardinality of S.
S is then the symptom-concept feature vector with the vector size
m where
m of the research is 143. PCA (based on IBM SPSS Statistics for Windows, Version 21.0) is used to reduce the number of
m features of the symptom-concept feature vector by wrapping each
sc feature element having the feature loading weight from the eigenvector ≥ |0.4| based on our corpus within the corresponding Fgroup
z to become Factor
z (
z = 1, 2,.., 39) as shown in
Table 3.
In
Table 3, the number of
sc features is reduced from 143 to 39 groups of Fgroup
z. From Equation (16), we conclude D
i as in Equation (17).
Therefore,
di pairs (
di ∈ D
i) with Factor
z instead of SymGroup
di (see
Table 4) where Factor
z consists of the correlated
sc feature elements with the FactorScore
z calculation to each disease name,
di.
In
Table 4, we then select the Factor
z occurrences as the common and relevant factors having the highest FactorScore
z of each
di for constructing DSKG (see
Figure 7).
6. Conclusions
This research approaches constructing a DSKG consisting of several CErel connections determined from the downloaded disease documents on the Thai hospital and Thai-Healthcare-Knowledge web-boards, where each CErel connection links the causative-concept node (or the <
di>node represented by the
di feature) to the correlated effect-concept node, or the <
sn1, s
n2, …,
snη> node represented by a group of the correlated
sc features as the common symptom-concept features. Moreover, all
sc features of each
di document on our research are based on the basic symptoms. Our proposed method of constructing the DSKG, which needs to reduce the high dimensional feature space of S (S = {
sc}) for the graph construction, relies on the determination of the
wcSymt,ij features representing the
sc features and the automatic supervised learning
dt,i wcSymt,ij pairs having CErel from the downloaded documents as the unstructured data. PCA is then proposed for constructing the DSKG by the dimensionality reduction of the symptom-concept feature space with minimizing information loss. To evaluate the proposed method, the conciseness and precision of the DSKG construction depends on the number of different symptom-concept features and the number of different disease-name concept features, whilst the accuracy of the CErel determination depends on the corpus behavior, e.g., the diversity of the
wcSymt,ij feature occurrences and the frequencies of the
wcSymt,ij feature occurrences. In addition, the accuracy of the
wcSymt,ij feature determination depends on the number of symptoms with long-tailed explanations. In contrast to previous works of the cause-effect/disease-symptom relation determination and the cause-effect/disease- symptom knowledge graph/network construction from the documents, (1) our determination of the symptom-concept feature,
wcSymt,ij, as the composite variable (which relies on the predicate-argument pattern from NP1 and VP through the preparation and collection of the Verb
strong, Verb
weak, Noun, Adj, and Adv sets from the medical-symptom-expression list on Wikipedia including MeSH without the concept annotation) has a high F1-Score. Whereas the symptom-concept determination of the previous works are only based on either the noun-phrase [
12,
13,
14,
15] or verb phrase [
16] concepts by either the expert annotation or the automatic string-matching to ICD-9 codes and UMLS concepts, (2) our determination of
dt,i wcSymt,ij pairs having CErel is based on the Cartesian product of DS × SG as the automatic supervised learning on each
di feature on the noun/noun-phrase expression being paired with each group of the
wcSymt,ij features on several EDUs’
wc expressions with/without the complications, whereas the previous works on determining the cause-effect/disease-symptom relation on the noun/noun-phrase pair with either supervised learning by experts [
11] or the automatic determination relied on the ICD-9 codes [
13] without concerning some
dt,i documents containing both the basic symptoms and the complications; (3) our DSKG construction by PCA to reduce the high dimensional symptom-concept features presents the DSKG with preciseness and high conciseness, whereas the previous works of the cause-effect/disease-symptom graph/network construction from texts consist of each CErel connection between the causative-concept node (represented by one causative-concept feature) and the corresponding effect-concept node (represented by one effect-concept feature) without concerning the high dimensional effect features, particularly the symptom features [
7,
13,
15]. Moreover, the DSKG results, e.g., in
Figure 7, were frequently found to be in alignment with scientific findings and also the objective of this research. For example, following the DSKG in
Figure 7, vascular diseases were found to be associated with several diseases including kidney disease and myocardial infarction which are also found in the literature (e.g., [
38,
39,
40]). In the future, the temporal feature and the condition feature should be considered to increase the accuracy of the CErel determination which results in increasing the preciseness of the DSKG for web-application development as in [
41]. Moreover, the proposed method can also be applied in other languages, and the DSKG of our research enhances the primary health care by supporting the non-professional persons with the knowledge structure in primarily diagnosis problems through the recommender system.