1. The Problem and the Method
This article investigates a natural question. We are asking whether differences in medical recommendations arise from differences in knowledge brought to the problem by different medical societies. To answer this question at scale we need an automated method to measure such differences. The purpose of this article is to present such a computational method and use a collection of case studies to evaluate its performance.
Our method uses the standard natural language processing approach to represent words and documents as embeddings, and combines it with a graph comparison algorithm. We evaluate our approach on three sets of medical guidelines: for breast cancer screening, lower back pain management guidelines and hypertension management guidelines.
The answer to this question matters because physicians with different specialties follow different guidelines. This results in the undue variability of treatment. Therefore, understanding what drives the differences in recommendation should contribute to its reduction, and to better patient outcomes [
1,
2,
3].
1.1. Motivation
There are over twenty thousand clinical practice guidelines indexed by PubMed (
https://pubmed.ncbi.nlm.nih.gov/ (accessed on 24 February 2020)), with over 1500 appearing every year [
4]. Since clinical practice guidelines are developed by different medical associations, which count on experts with different specialties and sub-specialties, there is a high possibility that there may be disagreement in the guidelines. Indeed, as noted by [
3], and discussed in [
5,
6], breast cancer screening guidelines contradict each other. Besides breast cancer screening disagreements, which we model in this article, controversies over PSA screening, hypertension and other treatment and prevention guidelines are also well-known.
Figure 1 illustrates our point. We see disagreements in seven breast cancer screening recommendations produced by seven different medical organizations. The hypothesis we investigate is that the contradictory recommendations reflect the specialized knowledge brought to bear on the problem by different societies.
Notice that the dominant view is to see expertise as a shared a body of information, and experts as
epistemic peers [
7] with identical levels of competence. Under this paradigm of shared knowledge and inferential abilities, the medical bodies should not differ in their recommendations. That they do is interesting and worth investigating. Thus, this article is also motivated by the idea that epistemology of disagreement [
7,
8,
9] can be modeled computationally. On the abstract level, we view medical disagreement as “near-peer” disagreement [
10,
11,
12], where we see expert groups as having partly overlapping knowledge. This article shows that such more realistic and fine-grained models can also be studied computationally, quantitatively, and at scale.
1.2. Brief Description of the Proposed Method
In this article we investigate the question of whether differences in medical recommendations come from differences in specialized medical knowledge applied to specific classes of patients, and whether such differences in specialties can be modeled computationally.
Our idea is to model “specialized medical knowledge”, which we cannot easily observe, by the differences in vocabulary used in medical guidelines. We then show that these vocabularies, assembled in vector representations of these documents, produce the differences in recommendations. We evaluate our method using three case studies: breast cancer screening guidelines, lower back pain management guidelines and hypertension management guidelines. In the main track of this article, we use the breast cancer screening guidelines to present our approach and the evaluation, and the additional evaluations on the other two sets of guidelines are presented in the
Appendix A.
More specifically, we computationally compare the
full texts of guidelines with the their
recommendation summaries. For breast cancer screening, the summaries come from the CDC [
3]; for lower back pain management, they come from a summary article [
13]; and, for hypertension management, where we lack a tabular comparison, we used the abstracts of the documents. That is, we see if the semantic similarities between the full documents follow the same pattern as semantic similarities between the summaries. Note that each computational comparison was made between two
sets of documents and not individual documents.
This process involves several steps and is shown in
Figure 2, for the breast cancer screening guidelines. Thus, the vector representations of full texts of the guidelines model the vocabularies as bags of concepts, and therefore cannot model specific recommendations: the concepts in the recommendations, such as “mammography” and “recommend”, appear in
all full texts, but specific societies may be either for mammography or against it. The vector representations of recommendations model the differences in prescribed procedures, but not the vocabularies (see
Table 1 and
Table 2 below).
How do we know if vocabularies determine recommendations? We compute pairwise distances (cosine or word mover’s distance) between the full text vectors. In parallel, we compute pairwise distances between the recommendation vectors. We thus get two graphs, and their shapes can be compared. We show that the resulting geometries are very similar and could not have been produced by chance.
This process is slightly modified for lower back pain management, where we start with the tables of disagreement from the summary article [
13]. For the hypertension management guidelines, we use the graph of summaries that is generated from the abstracts of full documents, because we do not have any tabular sets of comparisons similar to [
3,
13]. However, even with this change the proposed method performs very well. Notice that to model full documents we use a large number of high dimensional (200), real valued, vectors. By contrast, the vectors representing the recommendations only have a smaller number of discrete-valued features (five for the breast cancer screening, and 12, 59 and 71 for lower back pain management).
1.3. Summary of Contributions
The main contribution of this article is in proposing an automated, and relatively straightforward, method of text analysis that (1) computes conceptual differences between documents addressing the same topic (for example, breast cancer screening), and (2) these automated judgments have a high correlation with recommendations extracted from these documents by a panel of experts. We test the approach on the already mentioned breast cancer screening recommendations, as well as in other sets of experiments on lower back pain management and hypertension management guidelines. As such, these results open the possibility of large-scale analysis of medical guidelines using automated tools.
Another contribution is the articulation of a very natural graph clique-based algorithm/method for comparing the similarity of two collections of documents. Given two sets of documents, each of the same cardinality, and a mapping between nodes, we compute the percent of similarity (or, equivalently, distortion between the shapes of the two cliques), and the chances that the mapping arose from a random process.
We also document all steps of the process and provide the data and the code to facilitate both extensions of this work and its replication (the GitHub link is provided in
Section 8).
1.4. Organization of the Article
In
Section 2, we provide a brief overview of applications of natural language processing to texts of medical guidelines, word embedding, and some relevant work on disagreement. Afterwards, we follow the left-to-right order of
Figure 2 using the breast cancer screening guidelines as the motivating example (other experiments are described in the
Appendix A). Thus,
Section 3 and
Section 4 explain our example data sources: a CDC summary table of breast cancer screening guidelines and the corresponding full text documents. In these two sections, we also discuss the steps in the conceptual analysis of the table. First, the creation of a graph of conceptual distances between the columns of the table, and then the encoding of full documents as vectors, using two standard vectorization procedures. Our method of comparing summarized recommendations and full guideline documents is presented in three algorithms and discussed in
Section 5.
After observing a roughly 70% similarity between the distances in the summaries and the distances in the full documents, we prove in
Section 6 that this similarity is not accidental. We conclude in
Section 6 and
Section 8 that this case study shows that NLP methods are capable of approximate conceptual analysis in this space (using the
Appendix A for additional support). This opens the possibility of deepening this exploration using more sophisticated tools such as relationship extraction, other graph models, and automated formal analysis (as discussed in
Section 7 and
Section 8).
In the
Appendix A, we provide information about additional experiments we performed to validate the proposed method. (We decided to put this information in an appendix in order to simplify the main thread of the presentation). There, we first discuss a few variants of the main experiment, where we filtered out some sentences from the full guidelines’ texts. Then, we apply our method to two other collections of guidelines: namely, to hypertension and low back pain management guidelines. All of these experiments confirm the robustness of the proposed method and the system’s ability to computationally relate background knowledge to actual recommendations.
3. From Recommendations to Vectors of Differences and a Graph
We start with the simpler task of transforming the screening recommendations (referenced above in
Figure 1) to vectors of differences, representing the disagreements in the recommendations, and then to a graph of their conceptual distances, where, intuitively, the larger the number of recommendation differences, the bigger the distance.
We will proceed in three steps: First, using a diagram (
Figure 3) and a table (
Table 1) we make explicit the difference in recommendations in
Figure 1. Second, we transform the table into a count of differences (
Table 2) and from that we derive distances between pairs of recommendations (
Table 3). The graph representing the recommendations will have nodes named after each organization (e.g., AAFP, ACOG, etc.) and edges labeled and drawn with distances (
Figure 4).
3.1. Computing the Differences in Recommendations
Figure 3 is another representation of the information in the CDC comparison of the recommendations [
3], earlier presented in
Figure 1. It clearly shows the differences between the guidelines (and it comes from [
40]). As we can see, there are two sides to the circle. The yellow side indicates the scenario where patients will likely decide when breast cancer screening should be done, and the purple color side specifies the situation where breast cancer guideline providers most likely will demand screening interventions. White radial lines indicate boundaries between the different societies. The red color marks indicate that the physician decides. Green color marks indicate patients’ decisions.
3.2. From Differences to Distances and a Graph
Table 1 represent the content of this analysis as a collection of features.
Table 2 encodes these differences in recommendations as numbers of differing features between pairs of recommendations. Then,
Table 3 shows the distances between the guidelines derived from
Table 1 and
Table 2 using the
Jaccard distance (the percentage of different elements in two sets):
Given two recommendation summaries
A and
B we compute the number of the differing feature values from
Table 2 and divide it by five. For example, for the pair (AFP, ACR) we get
. All these distances were normalized to sum to 1 and shown in
Table 3 (we are not assuming that distances are always symmetric. In most cases they are, but later we will also report experiments with the search distances, which are not symmetric). The normalization does not change the relative distances, and in the comparisons with the geometry of full documents we only care about the relative distances.
Table 1,
Table 2 and
Table 3 represent the process of converting the information in
Figure 3 into a set of distances. These distances are depicted graphically in
Figure 4, where we display both Jaccard distances between the recommendations and the number of differing features as per
Table 2.
In the following section we will create a graph representation for the full documents (
Figure 5b). We will present our graph comparison method in
Section 5. In
Section 6, we will assign numerical values to the distance between the two graphs, and show that this similarity cannot be the result of chance.
4. Transforming Full Guidelines Documents into Vectors and Graphs
In this article, we use both the CDC summaries ([
3], reproduced and labeled in
Figure 1 and
Figure 3), and the full text of the guidelines used by the CDC to create the summaries. The focus of this section is on the full guideline documents. The detailed information about these guidelines is shown in
Table 4.
Note that in this section we are using the same acronyms (of medical societies) to refer to full guideline documents. This will not lead to confusion, as in this section we are only discussing full documents.
4.1. Data Preparation for All Experiments
From the breast cancer screening guidelines listed in the CDC summary document [
3], the texts of the USPSTF, ACS, ACP and ACR guidelines were extracted from their HTML format. We used Adobe Acrobat Reader to obtain the texts from the pdf format of the AAFP, ACOG, and IARC guidelines. Since the AAFP documents also included preventive service recommendations for other diseases (such as other types of cancers), we added a preprocess step to remove those recommendations, leaving the parts matching “breast cancer”.
4.2. Measuring Distances between Full Documents
When creating embedding representation of text, we replace each word or term with its embedding representation. Thus, the text full guideline documents are represented as a set of vectors. Our objective is to create a graph of conceptual distances between the documents.
The two most commonly used measures of distance, cosine distance and word mover’s distance, operate on different representations. The former operates on pairs of vectors, and the latter on sets of vectors. Thus, we need to create two types of representations.
Given a document, the first representation takes the average of all its word (term) embeddings. This creates a vector representing the guideline text. The second representation simply keeps the set of all its embedding vectors.
The
cosine distance between two vectors
v and
w is defined as:
We will also use the following variant of cosine distance to argue that the geometries we obtain in our experiments are similar irrespective of distances measures (see
Section 6):
The
word mover’s distance (WMD, WM distance), introduced in [
48], is a variant of the classic concept of “earth mover distance” from the transportation theory [
49]. Sometimes, the term “Wasserstein distance” is also used. The intuition encoded in this metric is as follows. Given two documents represented by their set of vectors, each vector is viewed as a divisible object. We are allowed to “move” fractions of each vector in the first set to the other set. The WM distance is the minimal total distance accomplishing the transfer of all vector masses to the other set. More formally [
48], WM distance minimizes:
where
is the fraction of word
i in document
d traveling to word
j in document
;
denotes the cost “traveling” from word
i in document
d to word
j in document
; here the cost is the Euclidean distance between two words in the embedding space. Finally,
is the normalized frequency of word
i in document
d (and same for
):
We used the
n_similarity and
wmdistance functions from Gensim [
50] as a tool for generating vectors and calculating similarities/distances in our experiments.
4.3. Building Vector Representations of Full Documents
However, as there are multiple distance measures, there is more than one way to create word embeddings; we experimented with several methods. We used three language models of medical guidelines’ disagreement: “no concept”, conceptualized and BioASQ. (The details of these experiments appear later in Table 6). The first two were Word2Vec embedding models trained using the PubMed articles as the training data. The third one used pre-trained BioASQ word embeddings created for the BioASQ competitions [
51] (
http://BioASQ.org/news/BioASQ-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts (accessed on 24 February 2021)).
Our first model, trained on PubMed, included only words, and no additional conceptual analysis with MeSH (
https://www.nlm.nih.gov/mesh/meshhome.html (accessed on 24 February 2021)) was done. In the second, which was a more complex model, MeSH terms were replaced with n-grams. For example, if
breast and
cancer appeared next to each other in the text, they were replaced with
breast-neoplasms and treated as a concept.
4.4. Our Best Model: Using BioASQ Embeddings and Word Mover’s Distance
Table 5 shows (unnormalized) WM distances between the seven guidelines using BioASQ embeddings.
Figure 5 shows side by side the geometries of the two graphs: one generated from the summary of full documents, using features derived from the CDC summaries, and the second one based on the machine-generated representations of the full guideline documents. To create
Figure 5, for each metric, a diagram representing the distance between the nodes (guidelines) and a diagram with the labeled edges were drawn using Python the networkx library (
https://networkx.github.io/ (accessed on 24 February 2021)). All values were normalized to the same scale to allow visual comparison.
The similarity is visible in a visual inspection, and will be quantified in
Section 6 to be about 70%. However, before we provide the details of the experiments, we will also answer two questions:
- —
How do we measure the distortion/similarity between the two graphs?
- —
Could this similarity of shapes be accidental? How do we measure such probability?
5. Graph-Based Method for Comparing Collections of Documents
At this point we have created two graphs, one showing the distances between summary recommendations, and the other representing conceptual distances between documents. The procedure we used so far can be concisely expressed as Algorithm 1, where given a set of documents, after specifying the
Model (type of embeddings) and a
distance metric, we get an adjacency matrix containing the distances between the nodes representing the documents. An example output of Algorithm 1 is shown in
Figure 4 above.
What remains to be done is to quantify the difference in shapes of these two graphs, and then to show that the similarity we observe is not accidental. The methods used in these two steps are described in Algorithms 2 and 3. The experiments and the details of the performed computations will be presented in
Section 6.
Algorithm 1 Computing Graph of Distances Between Documents. |
Input:Guidelines: a set of guideline documents in textual format. |
Model: a model to compute distances between two documents. |
Output:—Adjacency matrix of distances between document guidelines.
- 1:
for each pair of documents in Guidelines do - 2:
Compute the distance between the documents according to Model - 3:
Put the distance in - 4:
end for - 5:
return
|
We use a very natural, graph clique-based method for comparing similarity of two
collections of documents. Given two sets of documents, represented by graphs, and a one-to-one mapping between nodes, in Algorithm 2, we compute the percent distortion between the shapes of the two cliques—this is perhaps the most natural similarity measure (
similarity = 1 − distortion) for comparing the shapes of two cliques of identical cardinality.
Algorithm 2 Distance or Percentage Distortion between Two Complete Graphs (cliques of the same size). |
Input: Adjacency Matrices , of equal dimensions |
Output: Graph distance/distortion , as a value between 0 and 1.- 1:
Normalize the distances in (by dividing each distance by the sum of distances in the graph) to produce a new adjacency matrix - 2:
Normalize the distances in to produce a new adjacency matrix - 3:
Set the value of to 0. - 4:
for each edge in do - 5:
Add the absolute value of the difference between the edge length and its counterpart in to the - 6:
end for - 7:
return Note. For example, the distance between the two graphs in Figure 5 is 0.31, equivalent to 31% distortion
|
Next, we need to compute the chance that the mapping arose from a random process. This is because if the chances of the similarity arising from a random process are small, we can conclude that the conceptual vocabulary of a full document determines the type of recommendation given by a particular organization. In our case the nodes of both graphs have the same names (the names of the medical societies), but the shapes of the graphs are different, one coming from human summaries and comparison (
Figure 1,
Table 1) and the other from machine produced conceptual distances. Thus, the randomization can be viewed as a permutation on the nodes. When such permutations do not produce similar structures, we can conclude the similarity of the two graphs in
Figure 5 is not accidental.
Next, in Algorithm 3, we compute the average distortion, and the standard deviation of distortions, under permutation of nodes. The input consists of two cliques of the same cardinality. The distance measure comes from Algorithm 2.
Algorithm 3: Computing Graph Distortion Statistics. |
Input: Normalized Adjacency Matrices , of equal dimensions |
Output: Baseline for the graph distance, standard deviation of graph distances under permutations of computed distances.- 1:
Set the value of to an empty list. We are permuting the labels of graph, leaving the lengths of the edges intact. - 2:
for each permutation of the nodes of do - 3:
Compute using Algorithm 2 - 4:
Append d to - 5:
end for - 6:
- 7:
return, The input is two cliques of the same cardinality.
|
7. Discussion and Possible Extensions of This Work
Our broad research objective is to create a computational model accurately representing medical guidelines’ disagreements. Since the creation of such accurate models is beyond the current state of the art, in this article, we focused on an approximation, i.e., a model that is simple and general enough to be potentially applicable in other situations, and which was useful for the question at hand, namely, whether conceptual vocabulary determines recommendations.
As mentioned earlier, this article was partly motivated by epistemology of disagreement, and more specifically medical disagreement, viewed as “near-peer” disagreement. Our results show that it is possible to build computational models of “near-peer” disagreement. Additionally, they provide support for the empirical observations of disagreement adjudication among medical experts [
34,
35], where the authors observe that the differences in experts’ backgrounds increase the degree of disagreement.
A limitation of the article lies in testing the proposed method on a small number of case studies. In the main track, we focused on the CDC summaries of the breast cancer screening guidelines, and, in
Appendix A, we discuss our experiments on the lower back pain management and hypertension guidelines. We showed that the method is robust in the case of these sample guidelines, because even with the change of metrics, the similarities remain statistically significant. However, this article only describes a few case studies, and leaves it as an open question whether it will work equally well in other cases. Thus, an obvious extension of this work would be to compare other groups of guidelines, e.g., European medical societies vs. US medical societies. We know that for years their recommendations, e.g., on managing blood cholesterol, differed.
Another potential extension would be to experiment with other representations, such as more complex word and document embeddings, or with more subtle semantic representations based on entity and relationship extraction or formal models, cf. [
52], and on formal modeling of contradictions, like the ones discussed in [
5,
6]. This, however, would introduce several complications. Attention-based models, such as BERT, GPT or Universal Encoder [
53,
54,
55], would have to be used very carefully, since they encode the order of terms in documents, and indirectly relations between words. Therefore, they would not be appropriate for the experiments described in this article. More subtle formal models, on the other hand, are very brittle, and incapable of operating on real documents, with all the complications arising from interaction between sentences and the use of discourse constructions such as lists and tables. Perhaps one solution to this problem could be to represent the full text of each guideline document as a graph, and not a bag of word embeddings. There is a vast amount of applicable empirical work on graph representations, including representations of recommendations (e.g., [
56,
57]) and various types of knowledge (e.g., [
58,
59]). The algorithms proposed in
Section 5 would still be directly applicable, and only the distances between pairs of documents would have to be modified and computed on graph representations. These representations could vary, but in all cases we could use applicable algorithms for computing distances in graphs (e.g., [
60]), similar to the word mover’s distance (WMD) used in this article. In addition, by experimenting with matching corresponding subgraphs, we could develop new distance measures.
Unlike our earlier work [
6], in this article we have not performed any logical analysis of the guidelines. We also did not use text mining to extract relations from the content of the guidelines, and although our focus was on concepts appearing in guidelines, we did not point to specific vocabulary differences. Instead, we measured semantic differences between guidelines using the distances between their vectorial representations. This has to do with the fact that, even though NLP methods have progressed enormously over the last decade [
24], they are far from perfect. In our experiments, we used some of the simplest semantic types of words and simple collocations represented as vectors in high-dimensional spaces. This simplicity is helpful, as we can run several experiments, and compare the effects of using different representations and metrics. This gives us the confidence that the similarities we are discovering tell us something interesting about guideline documents.
8. Conclusions
This article investigates the question whether the disagreements in medical recommendations, for example in breast cancer screening or back pain management guidelines, can be attributed to the differences in concepts brought to the problem by specific medical societies (and not, e.g., the style or formalization of recommendations). Our experiments answered this question in the affirmative, and showed that a simple model using word embeddings to represent concepts can account for about 70% to 85% of disagreements in the recommendations. Another contribution is the articulation of a very natural graph clique-based algorithm/method for comparing the similarity of two collections of documents. Given two sets of documents, each of the same cardinality, and a mapping between nodes, we computed the percent of distortion between the shapes of the two cliques, and the chances that the mapping arose from a random process. We also documented all of the steps of the process and provided the data and the code (
https://github.com/hematialam/Conceptual_Distances_Medical_Recommendations (accessed on 24 February 2021)) to facilitate both extensions of this work and its replication.
Our work extends the state-of-the-art computational analysis of medical guidelines. Namely, instead of semi-automated conceptual analysis, we demonstrated the feasibility of automated conceptual analysis. That is, in our study, we used a representation derived from a (relatively shallow) neural network (BioASQ embeddings [
51]), and knowledge-based annotations derived from MetaMap (
https://metamap.nlm.nih.gov/ (accessed on 24 February 2021)). Our results, detailed in
Section 6 and in
Appendix A, show that both can be useful as representations of our set of guidelines. Overall, they show similar performance in modeling conceptual similarities. However, the BioAsq_WMD model, using the BioASQ embeddings and the Word Mover’s Distance, seems to be most stable, as it performed very well in all our experiments.
Although this article is a collection of three case studies, bound by a common method, it could be a good starting point for an analysis of other medical guidelines and perhaps other areas of expert disagreement. The methods described in this article are easy to use and rely on well-known tools such as word embeddings and MetaMap. They can also be extended and improved to produce more accurate and deeper analyses, due to the fast progress in text mining and deep learning techniques. From the point of view of methodology of analyzing medical guidelines, this article contains the first computational implementation of the “near-peer” model mentioned earlier. To our knowledge, ours is the first proposal to use automated methods of text analysis to investigate differences in recommendations.